I have a spark job which will be periodically submitted to do some job for every hour, so now I need to store that job complete status to some database whenever some action(for example filtering and writing is complete) is completed .
What is best way to get job progress(from spark stage) and store its progress, completion or error if any ?
I thought of using Hbase or some other no SQL but for this simple information Hbase or other DB will be overhead, also looks like there is no support of SQLite with spark, so What is the best way to store this information
I need below information to be put it db
Current JOB running, its progress, its in/output paths etc
Job completion , Pending Failure status etc
Related
we have a requirement to process millions of records using spring batch . We have planned to use a Spring Batch to do this by reading the db using JdbcPagingItemReaderBuilder and process in chunks and write it to Kaafka Queue. The active consumers of the queue will process the chunks of data and update the db
The consumer task is to iterate every item from the chunk and invoke the external api's.
In case the external system is down or not responding with success response , there should be retries of atleast 3 times and considering that each task in the chunk has to do this, what would be the ideal approach?
Another use case to consider, what happens when the job is processing and the system goes down and say that the job has already processed 10000 record and the remaining records are yet to be processed . After the restart how to make sure the execution doesnt restart the entire process from beginning and to resume from the point of failure.
Spring Batch creates the following tables. You can use them to check the status of your job and customize your scheduler to behave in a way you see fit.
I'd use the step execution Id in BATCH_STEP_EXCECUTION to validate the status that's set and then retry based off on that status, Or something similar to that sense.
BATCH_JOB_EXECUTION
BATCH_JOB_EXECUTION_CONTEXT
BATCH_JOB_EXECUTION_PARAMS
BATCH_JOB_INSTANCE
BATCH_STEP_EXECUTION
I import some data on my Druid Datasource. For that, I use Nifi and Tranquility for streaming injection with minute granularity (for my tests).
I've Ambari for check all my tasks and their status.
All my data are imported on my Datasource correctly and i can request them with Hive query.
When I look my tasks on Ambari, all of them are running, they are never "Complete". If I want to complete one of them, I have to kill it but I loose my data and status task is "FAILED".
I would like to understand what can I do for complete my tasks with success.
Thanks.
I found the problem.
In my tranquility conf, I had declared a big value for the "WindowPeriod".
In fact, the task automatically ends when the "WindowPeriod" end.
For example, "WindowPeriod":"PT10M" means that the task will end in 10 minutes.
Glad that you figured it out! Just want to call out for anyone reading this that Tranquility is deprecated. The streaming ingestion services such as https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion should be proffered for anyone starting a new deployment.
Is it possible to use spring batch as a regular job framework?
I want to create a device service (microservice) that has the responsibility
to get events and trigger jobs on devices. The devices are remote so it will take time for the job to be complete, but it is not a batch job (not periodically running or partitioning large data set).
I am wondering whether spring batch can still be used a job framework, or if it is only for batch processing. If the answer is no, what jobs framework (besides writing your own) are famous?
Job Description:
I need to execute against a specific device a job that will contain several steps. Each step will communicate with a device and wait for a device to confirm it executed the former command given to it.
I need retry, recovery and scheduling features (thought of combining spring batch with quartz)
Regarding read-process-write, I am basically getting a command request regarding a device, I do a little DB reads and then start long waiting periods that all need to pass in order for the job/task to be successful.
Also, I can choose (justify) relevant IMDG/DB. Concurrency is outside the scope (will be outside the job mechanism). An alternative that came to mind was akka actors. (job for a device will create children actors as steps)
As far as I know - not periodically running or partitioning large data set are not primary requirements for usage of Spring Batch.
Spring Batch is basically a read - process - write framework where reading & processing happens item by item and writing happens in chunks ( for chunk oriented processing ) .
So you can use Spring Batch if your job logic fits into - read - process - write paradigm and rest of the things seem secondary to me.
Also, with Spring Batch , you should also evaluate the part about Job Repository . Spring Batch needs a database ( either in memory or on disk ) to store job meta data and its not optional.
I think, you should put more explanation as why you need a Job Framework and what kind of logic you are running that you are calling it a Job so I will revise my answer accordingly.
I'm looking to impose a timeout on custom activities in data factory via the policy.timeout property in the activity json.
However I haven't seen any documentation to suggest how the timeout operates upon Azure batch? I assume that the batch task is forcibly terminated somehow.
But is the task -> custom activity informed so it can tidy up?
The reason I ask is that I could be mid-copying to data lake store and I neither want to let it run indefinitely nor stop it without some sort of clean up (I can't see a way of doing transactions as such using the data lake store SDK).
I'm considering putting the timeout within the custom activity, but it would be a shame to have timeouts defined at 2 different levels (I'd probably still want the overall timeout).
I feel your pain.
ADF does simply terminate the activity if its own time out is reached regardless of what state the invoked service is in.
I have the same issue with U-SQL processing calls. It takes a lot of proactive monitoring via PowerShell to ensure data lake or batch jobs have enough compute to complete jobs with natually increasing data volumes before the ADF timeout kill occurs.
I'm not aware of any graceful way for ADF to handle this because it would differ for each activity type.
Time to create another feedback article for Microsoft!
I have one process that stucks at the same point. The information that I know is the Task's index at the Details pages (referring to the Dashboard UI).
How can I debug/log exactly that task at specific index?
Based on then answer in:
How to get ID of a map task in Spark?
I can see how to get task info. But what are the IDs in the UI dashboard referred to in that object?
is ID = org.apache.spark.scheduler.TaskInfo.id and Index = org.apache.spark.schedulerTaskInfo.partionId ?
The IDs in the dashboard refers to partitions in spark. Whenever a job is launched, your input data is partitioned and depending on the number of partitions, you'll have them mapped to task IDs.
It's not a trivial task to debug spark jobs as they're map reduce tasks of your data done by your algorithm. It's fairly easy though, to add logs to debug your job after the fact. The logs would have to be collected on the workers, or in each of the executor's working directory.