I came across a problem and thought of a question I did not find a good answer to. And that is, how can I purposely make an AWS EMR step fail?
I have a Spark Scala script which is added as a Spark step with some command line arguments and the output of the script is written into S3.
But if something goes wrong reading and handling the command line arguments, then the logic of the script is skipped and the script ends. But for EMR it is normal behaviour, it does not know that an if block was not entered.
And after a "failed" run, the step status is still changed to "Completed" and it seems it was successful without the results being written into S3.
I want to finish the step so it would be in "Failed" status.
I can do that by throwing an exception and then I can see the corresponding exception with my message in EMR Step error logs. But is there a better way? I would like to handle all my exceptions myself, manually.
And in addition, can I use AWS SDK to somehow programmatically find out the reason of the step's failure?
Return a non-zero value from your program
To intentionally fail the EMR step, you can always put a silly logic which could fail the code in run-time.
For pyspark we are putting a piece of code as a=5/0. This will fail the code.
Otherwise you can give something like an S3 path which doesn't even exist. This will also fail the job in runtime.
OR
You can write exit(1) to return a non-zero value from your code to fail the EMR step.
Related
Using Confluent Kafka Cloud API endpoint I send HTTP request containing CREATE OR REPLACE STREAM command that is used to update already existent stream. Half of the times, however ksqldb returns 502 Error:
Could not write the statement 'CREATE OR REPLACE STREAM...
Caused by: Another command with the same id
It seems this is a well known issue happening when ksqlDB command fails to commit. It has been "recommended" to send the same command again and again "...as a subsequent retry of the command should work fine".
Another suggestion is to "to clean up the command from the futures map on a commit failure".
Here is a GitHub issue link: https://github.com/confluentinc/ksql/issues/6340
The first recommendation that talks about re-sending the same command multiple times in a row to finally make it work, does not seem to be a good solution. I wonder if the second recommendation to clean up the command from the future map is a better choice?
How do we clean up the command from the future map? Is there any other solution known to make ksqldb reliably execute the ksql command?
Can I restart job and process only skipped items after I have corrected the mistakes in file? I'm reading documentation and not finding currently this possibility. You can restart job if it is failed, but I'm thinking restarting job after it has been completed with some skipped items. If this cannot be achieved with configuration, what would be good way to implement it myself?
What I have done in a case similiar to yours, is to log each skipped item in a file.
Then I created a second job that load the file and process all logged items.
I am writing Spring batch for chunk processing.
I have Single Job and Single Step. Within that Step I have Chunks which are dynamically sized.
SingleJob -> SingleStep -> chunk_1..chunk_2..chunk_3...so on
Following is a case I am trying to implement,
If Today I ran a Job and only chunk_2 failed and rest of chunks ran successfully. Now Tomorrow I want to run/restart ONLY failed chunks i.e. in this case chunk_2. (I don't want to run whole Job/Step/Other successfully completed Chunks)
I see Spring batch allow to store metadata and using that it helps to restart Jobs. but I did not get if it is possible to restart specific chunk as discuss above.
Am I missing any concept or if it is possible then any pseudo code/theoretical explanation or reference will help.
I appreciate your response
That's how Spring Batch works in a restart scenario, it will continue where it left off in the previous failed run.
So in your example, if the in the first run chunk1 has been correctly processed and chunk2 failed, the next job execution will restart at chunk2.
I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.
A job has been submitted and an entry is also there in dba_jobs but this job is not comming in the running state.So there is no entry for the job in dba_jobs_running.But the parameter 'JOB_QUEUE_PROCESS' has the value 10
and there are no jobs in the running state.Please suggest how to solve this problem.
SELECT NEXT_DATE, NEXT_SEC, BROKEN, FAILURES, WHAT
FROM DBA_JOBS
WHERE JOB = :JOB_ID
What's that return? A BROKEN job won't kick off, and if the NEXT_DATE/NEXT_SEC is in the past, it won't kick off either.
I hope you labeled that database parameter correctly i.e. 'JOB_QUEUE_PROCESSES=10'.
This is typically why a job won't run.
Also check that the user/schema that is running the job is correct too.
An alternative is to use a different scheduling tool to run the job (i.e. cron on linux)