Spark - Databricks jobs fail occasionally due to slow cluster library import. The notebook execution starts before the library is fully initialized and the entire process fails. If the same notebook is executed shortly after the failure, import proceeds successfully w/o any other intervention needed.
I was trying using try/catch approach, but the imported objects are unsurprisingly available only within the try block:
var tries = 0
var importSuccessful = false
while(!importSuccessful && tries < 3){
try {
import org.fde.dataFoundation.helper
import org.fde.dataFoundation.transactionMapping._
importSuccessful = true
} catch{
case e:Exception => println("Write failed with error: " + e.getMessage)
tries += 1
if(tries >= 3) {
throw new Exception(s"FD&E Data Foundation library couldn't be imported after ${tries - 1} attempts. Check if library ocrDataFoundationS3 is installed on the cluster")
}
Thread.sleep(5*60*1000) //wait 5 minutes
}
}
Could you please recommend an option how to execute try/catch cycle and make the library available for the entire netbook?
Note: the internal process rules require use of libraries installed on VC.
Related
I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()
I have a custom AMI which runs my service. Using the AWS Java SDK, I create an EC2 instance using RunInstancesRequest from the AMI. Now before I begin to use my service, I must ensure that the newly created instance is up and running. I poll the instance using:
var transitionCompleted = false
while (!transitionCompleted) {
val currentState = instance.getState.getName
if (currentState == desiredState) {
transitionCompleted = true
}
if(!transitionCompleted) {
try {
Thread.sleep(TRANSITION_INTERVAL)
} catch {
case e: InterruptedException => e.printStackTrace()
}
}
}
So when the currentState of the instance turns into desiredState(which is running), I get the status that the instance is ready. However any newly created instance, despite being in running state, is not available for immediate use as it is still initializing.
How do I ensure that I return only when I'm able to access the instance and its services? Are there any specific status checks to make?
PS: I use Scala
You are checking instance state, while what you are actually interested in are the instance status checks. You could use describeInstanceStatus method from the Amazon Java SDK, but instead of implementing your own polling (in a non-idiomatic Scala) it's better to use a ready solution from the SDK: the EC2 waiters.
import com.amazonaws.services.ec2._, model._, waiters._
val ec2client: AmazonEC2 = ...
val request = new DescribeInstanceStatusRequest().withInstanceIds(instanceID)
ec2client.waiters.instanceStatusOk.run(
new WaiterParameters()
.withRequest(request)
// Optionally, you can tune the PollingStrategy:
// .withPollingStrategy(...)
)
)
To customize polling delay and retry strategies of the waiter check the PollingStrategy documentation.
Here is how my test suit is configured.
"test payments" should {
"Add 100 credits" in {
runTeamTest { team =>
withRunningKafka {
val addCreditsRequest = AddCreditsRequest(team.id.stringify, member1Email, 100)
TestCommon.makeRequestAndCheck(
member1Email,
TeamApiGenerated.addCredits().url,
Helpers.POST,
Json.toJson(addCreditsRequest),
OK
)
val foundTeam = TestCommon.waitForFuture(TeamDao.findOneById(team.id))
foundTeam.get.credits mustEqual initialCreditAmount + 100
}
}
}
"deduct 100 credits" in {
runTeamTest { team =>
withRunningKafka {
val deductCreditsRequest = DeductCreditsRequest(team.id.stringify, member1Email, 100)
TestCommon.makeRequestAndCheck(
member1Email,
TeamApiGenerated.deductCredits().url,
Helpers.POST,
Json.toJson(deductCreditsRequest),
OK
)
val foundTeam = TestCommon.waitForFuture(TeamDao.findOneById(team.id))
foundTeam.get.credits mustEqual initialCreditAmount - 100
}
}
}
Within Scalatest, the overarching suit name is "test payments" and the subsequent tests inside it have issues after the first one is run. If I run each of the two tests individually, they will succeed, but if I run the entire suit, the first succeeds and the second returns a org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition. exception. The code above doesn't display the code within the controllers that are being tested, but within the controller, I have a kafka consumer that is constantly polling and close() isn't called on it within the tests.
I'd suggest you use companion object methods EmbeddedKafka.start() and EmbeddedKafka.stop() in the beforeAll and afterAll sections. This way you also avoid stopping / starting Kafka again for a single test class.
Also try to make sure you're not trying to start 2 or more instances of Kafka on the same port at the same time.
If I started a process using Scala Process/ProcessBuilder. How can I get the pid of the process that was created?
I could not find any mention of the pid in the official docs:
http://www.scala-lang.org/api/2.10.4/index.html#scala.sys.process.Process
http://www.scala-lang.org/api/2.10.4/index.html#scala.sys.process.ProcessBuilder
http://www.scala-lang.org/api/2.10.4/index.html#scala.sys.process.package
2016: same question; I've been clicking through related questions for a few minutes, but still couldn't find any solution that is generally agreed upon. Here is a Scala version inspired by LRBH10's Java code in the answer linked by wingedsubmariner:
import scala.sys.process.Process
def pid(p: Process): Long = {
val procField = p.getClass.getDeclaredField("p")
procField.synchronized {
procField.setAccessible(true)
val proc = procField.get(p)
try {
proc match {
case unixProc
if unixProc.getClass.getName == "java.lang.UNIXProcess" => {
val pidField = unixProc.getClass.getDeclaredField("pid")
pidField.synchronized {
pidField.setAccessible(true)
try {
pidField.getLong(unixProc)
} finally {
pidField.setAccessible(false)
}
}
}
// If someone wants to add support for Windows processes,
// this would be the right place to do it:
case _ => throw new RuntimeException(
"Cannot get PID of a " + proc.getClass.getName)
}
} finally {
procField.setAccessible(false)
}
}
}
// little demo
val proc = Process("echo 'blah blah blaaah'").run()
println(pid(proc))
WARNING: scala code runner is essentially just a bash script, so when you use it to launch scala programs, it will do thousand things before actually starting the java process. Therefore, the PID of the java-process that you are actually interested in will be much larger than what the above code snippet returns. So this method is essentially useless if you start your processes with scala. Use java directly, and explicitly add Scala library to the classpath.
The scala.sys.io.process classes are wrappers around the Java classes for starting processes, and unfortunately it is difficult to obtain the PID from this API. See the stackoverlow question for this, How to get PID of process I've just started within java program?.
In Play, if an in thrown in initializing a class - for example, configuration.get("badKey").get being called, I believe there is a java.lang.ExceptionInInitializerError. However, no error is ever caught or logged by Play and it just goes on, broken.
How can I make sure this is caught, or logged, or something other than just ignored?
It brings the website to halt, but I don't know that it's happened.
I know that I shouldn't have these errors in the first place. What first prompted this is that the Application.conf file differs depending on if its my local environment or production. If a developer forgets to add a key to the production one, it can bring the website to a halt (silently). Conceivably, though, there could be other errors too.
You can use play's built in Logger.
import play.Logger
try {
configuration.get("badKey").get
} catch {
case e:ExceptionInInitializerError => {
Logger.error("ExceptionInInitializerError")
Logger.error(e.getMessage())
}
case e:Exception => {
Logger.error("Exception")
Logger.error(e.getMessage())
}
}
Or write an accessor that goes to a default or logs an error without needing a try catch
import play.Logger
import play.api.Play
object Conf {
private val config = Play.current.configuration
def get(key:String):String =
config.getString(key) match {
case Some(value:String) => value
case _ => {
Logger.warn("No value found in application.conf for [" + key + "]")
"some default value"
}
}
}