Spring Batch - writing in-memory in a step - spring-batch

I have a Spring batch job which has two steps. The first step's writer writes to memory, that is stores the data in a java data structure.
Is this correct? Does the writer have to write to a persistent storage? If the second step fails, would the job be able to restart correctly if I wrote to the memory in the first step? Is my assumption that a commit doesn't mean anything if I do things this way correct?

A writer does not have to write to a persistent storage. However, If the job fails and the JVM is stopped, you will lose that data.
Using a persistent job repository ensures that restart meta-data can survive a JVM crash hence the ability to restart the job where it left off.

Related

Zombie giant unkillable task blocks Druid at restart

I'm running Apache Druid 0.17 deploying with nohup ./bin/start-nano-quickstart > mylog.log. As the deep storage I am using s3 and I have parquet extension enabled and all work fine. I could ingest with several small spark partitioned parquet datasources from s3 correctly. All the remaining configurations are untouched.
As I tried loading a giant datasource to test the performance and resource usage the task died after a couple of hours because of OutOfMemory.(It was expected)
2020-02-07T17:32:20,519 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - New segment[arc_2016-09-29T12:00:00.000Z_2016-09-29T13:00:00.000Z_2020-02-07T17:22:45.965Z] for sequenceName[index_parallel_arc_chgindko_2020-02-07T14:59:32.337Z].
Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded
Now every time I restart Druid, it starts that giant task and it is impossible to kill it. Even when the task apparently dies or turns in waiting status the CPU usage is about 140% and I cannot submit new tasks to Druid. I tried to access the Derby database manually to find the task and remove it but I was not successful and this solution is really nasty. I know that I can change the database in the configuration so the next time I will have a fresh Druid but it is not a good solution as I will miss all other datasources. How can I get ready of this long running zombie task?

How to restart failed Chunks - Spring batch

I am writing Spring batch for chunk processing.
I have Single Job and Single Step. Within that Step I have Chunks which are dynamically sized.
SingleJob -> SingleStep -> chunk_1..chunk_2..chunk_3...so on
Following is a case I am trying to implement,
If Today I ran a Job and only chunk_2 failed and rest of chunks ran successfully. Now Tomorrow I want to run/restart ONLY failed chunks i.e. in this case chunk_2. (I don't want to run whole Job/Step/Other successfully completed Chunks)
I see Spring batch allow to store metadata and using that it helps to restart Jobs. but I did not get if it is possible to restart specific chunk as discuss above.
Am I missing any concept or if it is possible then any pseudo code/theoretical explanation or reference will help.
I appreciate your response
That's how Spring Batch works in a restart scenario, it will continue where it left off in the previous failed run.
So in your example, if the in the first run chunk1 has been correctly processed and chunk2 failed, the next job execution will restart at chunk2.

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.

Launch spark job on-demand from code

What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)
You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.

Jobs removed from registry when Spring XD is killed

I'm running Spring XD as single-node for my Sandbox environment with a MySQL DB for the batch tables. If I kill -15 the Spring XD process, then all the current definitions for my jobs and streams are lost (in the case of the jobs, the XD_JOB_REGISTRY is apparently deleted). Consequently, if I start up Spring XD again, I have lost all the previous jobs and streams definitions.
I would like to know whether this is intentional in Spring XD, or maybe due to the fact that I run in single-node mode? Or is it a bug?
EDITED TO ADD THE GIST OF SERVERS.YML:
https://gist.github.com/emedina/486b52f11bc146203534
The job and stream definitions are stored in Zookeeper while the stats for any executed jobs are stored in the database. The single-node server uses an embedded Zookeeper instance by default and that's my guess why your definitions are gone when restarting. Try setting up a separate Zookeeper instance with a permanent data location.