I am submitting multiple spark jobs in the following manner -
someCollection.foreach(m => {
..some code
sparkSubmitClass.run(m.name)
.. some code
})
where sparkSubmitClass.run() method basically uses a shell script which has $SPARK_HOME/bin/spark-submit and other related parameters.
The problem is that this code submits all the spark jobs in one go. What I want to achieve is - submit a job, then submit another job only when the earlier one finishes. This is because someCollection is ordered and the next job depends on data created by the previous job(s).
sparkSubmitClass.run() is on the following lines -
def run(appName: String)(implicit executionContext: ExecutionContext) = {
val command = s"sparkJob.sh $appName"
val processBuilder = Process(command)
val pio = new ProcessIO(_ => (),
stdout => {
scala.io.Source.fromInputStream(stdout)
.getLines.foreach(str => log.info(s"spark-submit: Application
Name=$appName stdout='${str.replace("'", "\\'")}'"))
},
stderr => {
val lines = scala.io.Source.fromInputStream(stderr).getLines().toBuffer
lines.foreach(str => log.info(s"spark-submit: Application Name=$appName
stderr='${str.replace("'", "\\'")}'"))
lines.flatMap(parseLineForApplicationUrl).headOption.foreach(appId =>
appId)
})
val process = processBuilder.run(pio)
val exitVal = process.exitValue() //returns 0 as soon as application is
submitted
}
And sparkJob.sh is basically -
MAIN_CLASS="com.SomeClassHavingRDDAndHiveOperations"
APPNAME=$1
JAVA_OPTS="-XX:MaxDirectMemorySize=$WORKER_DIRECT_MEM_SIZE -
XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true"
SPARK_HOME="/usr/lib/spark"
cmd='$SPARK_HOME/bin/spark-submit --class $MAIN_CLASS
--name ${APPNAME}
--conf "spark.yarn.submit.waitAppCompletion=false"
--conf "spark.io.compression.codec=snappy"
--conf "spark.kryo.unsafe=true"
--conf "spark.kryoserializer.buffer.max=1024m"
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
--driver-java-options "-XX:MaxMetaspaceSize=$WORKER_PERM_SIZE $JAVA_OPTS"
$appdir/SomeJar.jar $APPNAME'
eval $cmd
Any thoughts on how to build this kind of ordering?
Instead of writing bash scripts and calling each job and waste io/read-write phase why don't you loop the jobs ordering as you need inside your code. Here's some hints for you to follow :
First you have to make sure that you have an interface and you implement that interface to every class you want to process in order so that you can have a common method to start for each jobs. (in this example the method is process and the interface is JobInterface)
Then you need to write all the class-names-with-package in one file with the order you want. lets say that file is orderedJobs (you don't need to mention extension)
package1.Class1
package1.Class2
package2.Class3
....
Read and parse that file. I am assuming it to be in resouces folder, you can filter the lines you don't want
val classCall = Source.fromInputStream(getClass.getResourceAsStream(<locationOforderedJobs>)).getLines().filter(!_.startsWith("#"))
Loop with foreach for each class and call the common method defined (process)
classCall.foreach(job => {
processJob(job).process(<you can pass arguments>)
}
processJob is a funtion where you instantiate each class
def processJob(name: String): JobInterface = {
val action = Class.forName("<package path from source root>"+className).newInstance()
action.asInstanceOf[JobInterface]
}
This way you can reduce the io/read-write time wastage, increase spark processing efficiency by storing useful data for other jobs in-memory, reduce the processing time and many more...
I hope it helps
Related
Suppose I have a method to be executed once on every worker node.
The following code is what I have come up with to achieve this goal, but it seems that the method is executed twice on the same worker node(there are a master and two worker nodes altogether).
val noOfExecs = sparkSession.sparkContext.getExecutorMemoryStatus.keys.size
val results = sparkSession.sparkContext
.parallelize(0 until noOfExecs, noOfExecs)
.map { _ =>
new SomeClass().doSomething()
}
.cache()
results.count()
How can I make sure that the method is executed only once on every worker node?
Maybe you've confused yourself in drawing the conclusion. Why do you say the method is executed twice on the same worker node?
a few things needs to be clarified for spark:
the noOfExecs by using the method sparkSession.sparkContext.getExecutorMemoryStatus.keys.size will return the total number of executors plus driver. which will be 3 if you have two workers/executors.
breaking down your code into a few chunks, first a data set to be parallelized out to the spark cluster, it is basically an array/range of integers. (0,1,2). Note you can not really control which integer get sent to which worker.
and you map over the integer(s), so there are 3 values in the data set across all 2 workers, and you ask the worker to do something. ( below I've modified it to print to the WORKERs stdout - console output, so when you check the WORKER log, you know which data is executed on that worker.)
the rest of cache or results.count() are just noise.
a method will be executed once on each worker, if you use method like map. so you do not have to ensure this, spark should.
see below. and you should be able to check worker's log. in my test, 1 work log has this
on worker, this method is executed for data number: 1, happy sharing
and the other worker has this:
on worker, this method is executed for data number: 0, happy sharing
on worker, this method is executed for data number: 2, happy sharing
below is your code being modified.
class SomeClass()
{
def doSomething(x:Int) = {
println(s"on worker, this method is executed for data number: $x, happy sharing")
}
}
// below return 3 for 1 driver, 2 executors/workers cluster setup
val driverAndWorkers = spark.sparkContext.getExecutorMemoryStatus
val noOfExecs = driverAndWorkers.keys.size
//below is basicaly '0,1,2'
val data = 0 until noOfExecs
val rddOfInt = spark.sparkContext.parallelize(data,noOfExecs) //, noOfExecs can be removed. in this case/topic, it does not matter how you partition the RDD.
val results = rddOfInt
.map { x =>
new SomeClass().doSomething(x)
}
.cache()
results.count()
Below is my Code:
class Data(val x:Double=0.0,val y:Double=0.0) {
var cluster = 0;
}
var dataList = new ArrayBuffer[Data]()
val data = sc.textFile("Path").map(line => line.split(",")).map(userRecord => (userRecord(3), userRecord(4)))
data.foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
When I do
dataList.size
i get output as 0
But there are more that 4k records in data.
Now when I try using take
data.take(10).foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
Now I got the data in dataList. But I want my whole data in dataList.
Please help.
The problem is that your code inside the foreach is running on distributed worker not the same main thread where you inspect dataList.length. Use Rdd.collect() to get it.
val dataList = data
.take(10)
.map(a => new Data(a._1.toDouble, a._2.toDouble))
.collect()
The problem is related to where your code will be executed. Every operation made inside a transformation, i.e. a map, flatMap, reduce, and so on, is not performed in the main thread (or in the driver node), but in the worker nodes. These nodes run in different threads (or hosts) than the driver node.
Every object that is not store inside a RDD and that is used in a worker node, lives only in the worker memory space. Then, your dataList object is simply freshly created in each worker and the driver node cannot retrieve any information from this remote objects.
The code in the main program and in the so called actions, i.e. foreach, collect, take and so on, is executed in the main thread or driver node. Then, when you run
data.take(10).foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
the take method is getting back from workers the first 10 data of the RDD. All code is executed in the driver node and the magic works.
If you want to build an RDD of Data object you have to transform the information that you read directly into the oringal RDD. Try something similar to the following:
val dataList: RDD[Data] =
data.map(a => new Data(a._1.toDouble, a._2.toDouble))
Try to have a look also to this post: A new way to err, Apache Spark.
Hope it helps.
I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()
I've come across the following code in a Gatling scenario (modified for brevity/privacy):
val scn = scenario("X")
.repeat(numberOfLoops, "loopName") {
exec((session : Session) => {
val loopCounter = session.getTypedAttribute[Int]("loopName")
session.setAttribute("xmlInput", createXml(loopCounter))
})
.exec(
http("X")
.post("/rest/url")
.headers(headers)
.body("${xmlInput}"))
)
}
It's naming the loop in the repeat block, getting that out of the session and using it to create a unique input XML. It then sticks that XML back into the session and extracts it again when posting it.
I would like to do away with the need to name the loop iterator and accessing the session.
Ideally I'd like to use a Stream to generate the XML.
But Gatling controls the looping and I can't recurse. Do I need to compromise, or can I use Gatling in a functional way (without vars or accessing the session)?
As I see it, neither numberOfLoops nor createXml seem to depend on anything user related that would have been stored in the session, so the loop could be resolved at build time, not at runtime.
import com.excilys.ebi.gatling.core.structure.ChainBuilder
def addXmlPost(chain: ChainBuilder, i: Int) =
chain.exec(
http("X")
.post("/rest/url")
.headers(headers)
.body(createXml(i))
)
def addXmlPostLoop(chain: ChainBuilder): ChainBuilder =
(0 until numberOfLoops).foldLeft(chain)(addXmlPost)
Cheers,
Stéphane
PS: The preferred way to ask something about Gatling is our Google Group: https://groups.google.com/forum/#!forum/gatling
i'm trying to get a clean and gracefull shutdown, and for some reason, it wont execute. iv'e tried:
sys addShutdownHook{
logger.warn("SHUTTING DOWN...")
// irrelevant logic here...
}
and also:
Runtime.getRuntime.addShutdownHook(ThreadOperations.delayOnThread{
logger.warn("SHUTTING DOWN...")
// irrelevant logic here...
}
)
where ThreadOperations.delayOnThread definition is:
object ThreadOperations {
def startOnThread(body: =>Unit) : Thread = {
onThread(true, body)
}
def delayOnThread(body: =>Unit) : Thread = {
onThread(false, body)
}
private def onThread(runNow : Boolean, body: =>Unit) : Thread = {
val t=new Thread {
override def run=body
}
if(runNow){t.start}
t
}
// more irrelevant operations...
}
but when i run my program (executable jar, double activation), the hook does not start. so what am i doing wrong? what is the right way to add a shutdown hook in scala? is it in any way related to the fact i'm using double activation?
double activation is done like that:
object Gate extends App {
val givenArgs = if(args.isEmpty){
Array("run")
}else{
args
}
val jar = Main.getClass.getProtectionDomain().getCodeSource().getLocation().getFile;
val dir = jar.dropRight(jar.split(System.getProperty("file.separator")).last.length + 1)
val arguments = Seq("java", "-cp", jar, "boot.Main") ++ givenArgs.toSeq
Process(arguments, new java.io.File(dir)).run();
}
(scala version: 2.9.2 )
thanks.
In your second attempt, your shutdown hook you seems to just create a thread and never start it (so it just gets garbage collected and does nothing). Did I miss something? (EDIT: yes I did, see comment. My bad).
In the first attempt, the problem might just be that the underlying log has some caching, and the application exits before the log is flushed.
Solved it.
For some reason, I thought that run as opposed to ! would detach the process. It actually hangs on because there are open streams left to the Process, which is returned from run (or maybe it just hangs for another reason, 'cause exec doesn't hang, but returns a Process with open streams to and from the child process, much like run). For this reason, the original process was still alive, and I accidentally sent the signals to it. Of course, it did not contain a handler, or a shutdown hook, so nothing happened.
The solution was to use Runtime.getRuntime.exec(arguments.toArray) instead of Process(arguments, new java.io.File(dir)).run();, close the streams in the Gate object, and send the ^C signal to the right process.