I am running a Job on multiple nodes using Spring batch.
implementation 'org.springframework.boot:spring-boot-starter-batch'
implementation 'org.springframework.boot:spring-boot-starter-actuator'
implementation 'io.micrometer:micrometer-registry-prometheus:1.3.2'
implementation 'io.prometheus:simpleclient_pushgateway:0.9.0'
And basically, PushGateway's grouping key is used as host and job name.
And while maintaining the above metrics,
I want to create a logic that is sent once again by excluding the host from the grouping key and configuring it only with the job name.
for example
When A Job is executed, we want to have two Groups in PushGateway.
Group1 - host=node1, jobName=A Job
Group2 - jobName=A Job
I wonder if this is possible and how do you think about it?
The doc states that it is possible to schedule multiple jobs from within one Spark Session / context. Can anyone give an example on how to do that? Can I launch the several jobs / Action, within future ? What Execution context should I use? I'm not entirely sure how spark manage that. How the driver or the cluster is aware of the many jobs being submitted from within the same driver. Is there anything that signal spark about it ? If someone has an example that would be great.
Motivation: My data is key-Value based, and has the requirement that for each group associated with a key I need to process them in
batch. In particular, I need to use mapPartition. That's because In each
partition I need to instantiate an non-serializable object for
processing my records.
(1) The fact is, I could indeed, group things using scala collection directly within the partitions, and process each group as a batch.
(2) The other way around, that i am exploring would be to filter the data by keys before end, and launch action/jobs for each of the filtered result (filtered collection). That way no need to group in each partition, and I can just process the all partition as a batch directly. I am assuming that the fair scheduler would do a good job to schedule things evenly between the jobs. If the fair Scheduler works well, i think this solution is more efficient. However I need to test it, hence, i wonder if someone could provide help on how to achieve threading within a spark session, and warn if there is any down side to it.
More over if anyone has had to make that choice/evaluation between the two approach, what was the outcome.
Note: This is a streaming application. each group of record associated with a key needs a specific configuration of an instantiated object, to be processed (imperatively as a batch). That object being non-serializable, it needs to be instantiated per partition
Is it possible to use spring batch as a regular job framework?
I want to create a device service (microservice) that has the responsibility
to get events and trigger jobs on devices. The devices are remote so it will take time for the job to be complete, but it is not a batch job (not periodically running or partitioning large data set).
I am wondering whether spring batch can still be used a job framework, or if it is only for batch processing. If the answer is no, what jobs framework (besides writing your own) are famous?
Job Description:
I need to execute against a specific device a job that will contain several steps. Each step will communicate with a device and wait for a device to confirm it executed the former command given to it.
I need retry, recovery and scheduling features (thought of combining spring batch with quartz)
Regarding read-process-write, I am basically getting a command request regarding a device, I do a little DB reads and then start long waiting periods that all need to pass in order for the job/task to be successful.
Also, I can choose (justify) relevant IMDG/DB. Concurrency is outside the scope (will be outside the job mechanism). An alternative that came to mind was akka actors. (job for a device will create children actors as steps)
As far as I know - not periodically running or partitioning large data set are not primary requirements for usage of Spring Batch.
Spring Batch is basically a read - process - write framework where reading & processing happens item by item and writing happens in chunks ( for chunk oriented processing ) .
So you can use Spring Batch if your job logic fits into - read - process - write paradigm and rest of the things seem secondary to me.
Also, with Spring Batch , you should also evaluate the part about Job Repository . Spring Batch needs a database ( either in memory or on disk ) to store job meta data and its not optional.
I think, you should put more explanation as why you need a Job Framework and what kind of logic you are running that you are calling it a Job so I will revise my answer accordingly.
I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".
Here is what I'm trying to achieve in a Spring Batch job:
A partitioner launches a FlowStep
The FlowStep consists of n step(s)
In case of failure, I want a consistent restart of the inner steps
I encounter the following issue during a restart:
Suppose I have 2 partitions, for the sake of simplicity I have a syncTaskExecutor. The first partition (partition0) runs well, we run now the second partition (partition1).
The first problem is that the sub-steps of the FlowStep are detected as duplicates. This is because the names of the sub-steps are not suffixed with the partition index. But the steps run ultimately.
The consequence of this happens if one sub-step fails. In that case, during a restart, since all sub-steps of the partition0 execution exit successfuly, the remaining steps of partition1 won't be executed.
The main problem here is that the sub-steps of a partitioner are not indexed and therefor detected as equivalent but they are not.
Additionally I don't want to set the sub-steps as restartable because I just want the missing steps to be executed and not all of them.
Am I missing something at this point? Do you have an alternative for what I want to do?
I know I could also launch a real job from the partitioner (using a JobStep) but this is not as powerful as FlowStep because we are really limited by the parameters we can provide to a job (no existing ExecutionContext). The guy here had the same issue I guess (
Spring batch Partitioning with multiple steps in parallel?)
Thank you for your help
After digging in the Spring Batch arcanes, I think I can answer my own question and maybe help some other people.
The key here is to provide our own StepHandler instead of the default SimpleStepHandler. In this handler, we can use the provided ExecutionContext to look after a predefined key that will contain the current partition id. We just need to use this id to build a unique step name in the form step.getName() + ":" + id.
In order to insert this custom StepHandler, we override the default FlowStep implementation.
A complete example can be found here https://github.com/miremond/spring-boot-sample-batch.
I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.