I am new to both quartz and spring batch. Would like to ask whether is it possible to implement spring batch with quartz as the single entry point for starting, stopping and restarting jobs? Quartz and Spring batch have their own sets of tables for tracking the jobs. Will there be any issues by using quartz as the single entry point?
Related
I want to read data from my DB (MySQL) do some processing and then write the result to kafka.
Is it a good practice to use spring batch for infinite chunked step?
keep reading the data from the database for ever? (The database is active during batch processing since it a db of web app)
Batch processing is about fixed, finite data sets. You seem to be looking for a streaming solution, which is out of scope of Spring Batch.
I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample
Hi I am new to Azure data factory and not all familiar with the back-end processing that run behind the scenes. I am wondering if there is a performance impact to running couple of data flows in parallel when compared to having all the transformations in one data flow.
I am trying to stage some data with a not exists transformation. i have to do it for multiple tables. when i test ran two data flows in parallel the clusters were brought up together for both the data flows simultaneously. But I am not sure if this the best approach to distribute the loading of tables across couple of data flows or to have all the transformations in one data flow
1: If you execute data flows in a pipeline in parallel, ADF will spin-up separate Spark clusters for each based on the settings in your Azure Integration Runtime attached to each activity.
2: If you put all of your logic inside a single data flow, then it will all execute in that same job execution context on a single Spark cluster instance.
3: Another option is to execute the activities in serial in the pipeline. If you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) but you will still a brand-new Spark context for each execution.
All are valid practices and which one you choose should be driven by your requirements for your ETL process.
No. 3 will likely take the longest time to execute end-to-end. But it does provide a clean separation of operations in each data flow step.
No. 2 could be more difficult to follow logically and doesn't give you much re-usability.
No. 1 is really similar to #3, but you run them all in parallel. Of course, not every end-to-end process can run in parallel. You may require a data flow to finish before starting the next, in which case you're back in #3 serial mode.
I have a requirement where some of my quartz job should run in Clustered way (Only one node out of three should run the job) and some jobs to run in non clustered way (All 3 nodes out of 3 should run the job).
Now my questions is can I use the same set of tables in a data source for both this requirement.
Here is what i can do for achieving the same.
2 quartz.properties one for clustered instance and one for non clustered.
Both instances of cluster will start at application startup.
So the jobs configured under the non clustered scheduler will be saved with the name of scheduler as NON_CLST_SCHE in jobs table, in the same table under different scheduler name.
Is this the right way to use the quartz ? Do we face any problem of data corruption problems?
As per quartz Documentation # http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/tutorial-lesson-11.html it says that
Never fire-up a non-clustered instance against the same set of tables that any other instance is running against. You may get serious data corruption, and will definitely experience erratic behavior.
Now if above explanation is true what is the way out for my requirement.
Any help is much appreciated, thanks in advance !
I think that your approach is fine (assuming that all non clustered schedulers that access the same database have a unique scheduler name).
In my oppinion the warning refers to the case when you have multiple, non clustered instances with the same scheduler-name that run against the same database. E.g. a scheduler could only see jobs, triggers and so on from a JDBC-Jobstore if the associated SchedulerName of that job (trigger,...) matches that of the scheduler.
There is a java process, which fires a long running database query to fetch huge number of rows from DB. Then these rows are written to a file. The query cannot be processed on a chunk basis because of various reasons.
I just wrapped the process in a Spring Batch tasklet and started the job.
Observed that the normal java process is 4 times faster than the Spring Batch Job. I am aware that the above scenario is not suitable for a Spring batch configuration, but just curious to know why the process is slow, when it is made as a Tasklet.
[Edit] Recently I created another batch process ,which contains an ItemProcessor, to validate each item agains a set of data which should be loaded before the job starts. I created a job listener to initialize the set of data from Oracle DB. The set contains almost 0.2 million records and reading these data takes almost 1.5 hours. So seriously doubt spring batch has some limitation on reading large amount of data in a single shot from DB.