Killing the spark.sql - scala

I am new to scala and spark both .
I have a code in scala which executes quieres in while loop one after the other.
What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move on to the next one
for example
do {
var f = Future(
spark.sql("some query"))
)
f onSucess {
case suc - > println("Query ran in 10mins")
}
f failure {
case fail -> println("query took more than 10mins")
}
}while(some condition)
var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))
I understand that when we call spark.sql the control is sent to spark which i need to kill/stop when the duration is over so that i can get back the resources
I have tried multiple things but I am not sure how to solve this.
Any help would be welcomed as i am stuck with this.

Related

Is there an Operation to block onComplete?

I am trying to learn reactive programming, so forgive me if I ask a silly question. I'm also open to advice on changing my design.
I am working in scala-swing to display the results of a simulator. With one setting, a chart is displayed as a histogram; with the other setting the chart is displayed as the cumulative sum. (I'm probably using the wrong word; in the first setting you might have bin1=2, bin2=5, bin3=3; in the second setting the first height is 2, the second is 2 + 5, the third is 2 + 5 + 3, etc.). The simulator can be slow, so I originally used a Future to compute it, and the set the data into the chart. I decided to try a reactive approach, so my requirements are: 1. I don't want to recreate the data when I change the display mode, and 2. I want to set the Observable once for the chart and have the chart listen to the same Observable permanently.
I got this to work when I started the chain with a PublishSubject and the Future set the data into the start of the chain. When the display mode changed, I created a new PublishSubject().map(newRenderingLogic).subscribe(theChartsObservable). I am now trying to do what looks like the "right way," but it's not working correctly. I've tried to simplify what I have done:
val textObservable: Subject[String] = PublishSubject()
textObservable.subscribe(text => {
println(s"Text: ${text}")
})
var textSubscription: Option[Subscription] = None
val start = Observable.from(Future {
"Base text"
}).cache
var i = 0
val button = new Button() {
text = "Click"
reactions += {
case event => {
i += 1
if (textSubscription.isDefined) {
textSubscription.get.unsubscribe()
}
textSubscription = Some(start.map(((j: Int) => { (base: String) => s"${base} ${j}" })(i)).subscribe(textObservable))
}
}
}
On start, an Observable is created and logic to print some text is added to it. Then, an Observable with the generated data is created and a cache is added so that the result is replayed if the next subscription comes in after its results are generated. Then, a button is created. Then on button clicks a middle observable is chained with unique logic (it's a function that creates a function to append the value of i into the string, run with the current value of i; I tried to make something that couldn't just be reused) that is supposed to change with each click. Then the first Observable is subscribed to it so that the results of the whole chain end up being printed.
In theory, the cache operation takes care of not regenerating the data, and this works once, but onComplete is called on textObservable and then it can't be used again. It works if I subscribe it like this:
textSubscription = Some(start.map(((j: Int) => { (base: String) => s"${base} ${j}" })(i)).subscribe(text => textObservable.onNext(text)))
because the call to onComplete is intercepted, but this looks wrong and I wanted to know if there was a more typical way to do this, or architect it. It makes me think that I don't understand how this is supposed to be done if there isn't an out-of-the-box operation to do this.
Thank you.
I'm not 100% sure if I got the essence of your question right, but: if you have an Observable that may complete and you want to turn it into an Observable that never completes, you can just concatenate it with Observable.never.
For example:
// will complete after emitting those three elements:
val completes = Observable.from(List(1, 2, 3))
// will emit those three elements, but will never complete:
val wontComplete = completes ++ Observable.never

Creating Seq after waiting for all results from map/foreach in Scala

I am trying to loop over inputs and process them to produce scores.
Just for the first input, I want to do some processing that takes a while.
The function ends up returning just the values from the 'else' part. The 'if' part is done executing after the function returns the value.
I am new to Scala and understand the behavior but not sure how to fix it.
I've tried inputs.zipWithIndex.map instead of foreach but the result is the same.
def getscores(
inputs: inputs
): Future[Seq[scoreInfo]] = {
var scores: Seq[scoreInfo] = Seq()
inputs.zipWithIndex.foreach {
case (f, i) => {
if (i == 0) {
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
scores = scores.:+(score)
})
})
} else {
scores = scores.:+(
scoreInfo(
id = "",
score = 5
)
)
}
}
}
Future {
scores
}
}
For what you need, I would drop the mutable variable and replace foreach with map to obtain an immutable list of Futures and recover to handle exceptions, followed by a sequence like below:
def getScores(inputs: Inputs): Future[List[ScoreInfo]] = Future.sequence(
inputs.zipWithIndex.map{ case (input, idx) =>
if (idx == 0)
getGeoScore(input).map(_.getOrElse(defaultScore)).recover{ case e => errorHandling(e) }
else
Future.successful(ScoreInfo("", 5))
})
To capture/print the result, one way is to use onComplete:
getScores(inputs).onComplete(println)
The part your missing is understanding a tricky element of concurrency, and that is that the order of execution when using multiple futures is not guaranteed.
If your block here is long running, it will take a while before appending the score to scores
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
// stick a println("here") in here to see what happens, for demonstration purposes only
scores = scores.:+(score)
})
})
Since that executes concurrently, your getscores function will also simultaneously continue its work iterating over the rest of inputs in your zipWithindex. This iteration, especially since it's trivial work, likely finishes well before the long-running getgeoscore(f) completes the execution of the Future it scheduled, and the code will exit the function, moving on to whatever code is next after you called getscores
val futureScores: Future[Seq[scoreInfo]] = getScores(inputs)
futureScores.onComplete{
case Success(scoreInfoSeq) => println(s"Here's the scores: ${scoreInfoSeq.mkString(",")}"
}
//a this point the call to getgeoscore(f) could still be running and finish later, but you will never know
doSomeOtherWork()
Now to clean this up, since you can run a zipWithIndex on your inputs parameter, I assume you mean it's something like a inputs:Seq[Input]. If all you want to do is operate on the first input, then use the head function to only retrieve the first option, so getgeoscores(inputs.head) , you don't need the rest of the code you have there.
Also, as a note, if using Scala, get out of the habit of using mutable vars, especially if you're working with concurrency. Scala is built around supporting immutability, so if you find yourself wanting to use a var , try using a val and look up how to work with the Scala's collection library to make it work.
In general, that is when you have several concurrent futures, I would say Leo's answer describes the right way to do it. However, you want only the first element transformed by a long running operation. So you can use the future return by the respective function and append the other elements when the long running call returns by mapping the future result:
def getscores(inputs: Inputs): Future[Seq[ScoreInfo]] =
getgeoscore(inputs.head)
.map { optInfo =>
optInfo ++ inputs.tail.map(_ => scoreInfo(id = "", score = 5))
}
So you neither need zipWithIndex nor do you need an additional future or join the results of several futures with sequence. Mapping the future just gives you a new future with the result transformed by the function passed to .map().

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

Spark collect()/count() never finishes while show() runs fast

I'm running Spark locally on my Mac and there is a weird issue. Basically, I can output any number of rows using show() method of the DataFrame, however, when I try to use count() or collect() even on pretty small amounts of data, the Spark is getting stuck on that stage. And never finishes its job. I'm using gradle for building and running.
When I run
./gradlew clean run
The program gets stuck at
> Building 83% > :run
What could cause this problem?
Here is the code.
val moviesRatingsDF = MongoSpark.load(sc).toDF().select("movieId", "userId","rating")
val movieRatingsDF = moviesRatingsDF
.groupBy("movieId")
.pivot("userId")
.max("rating")
.na.fill(0)
val ratingColumns = movieRatingsDF.columns.drop(1) // drop the name column
val movieRatingsDS:Dataset[MovieRatingsVector] = movieRatingsDF
.select( col("movieId").as("movie_id"), array(ratingColumns.map(x => col(x)): _*).as("ratings") )
.as[MovieRatingsVector]
val moviePairs = movieRatingsDS.withColumnRenamed("ratings", "ratings1")
.withColumnRenamed("movie_id", "movie_id1")
.crossJoin(movieRatingsDS.withColumnRenamed("ratings", "ratings2").withColumnRenamed("movie_id", "movie_id2"))
.filter(col("movie_id1") < col("movie_id2"))
val movieSimilarities = moviePairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
MovieSimilarity(row.getAs[Long]("movie_id1"), row.getAs[Long]("movie_id2"), corr)
}).cache()
val collectedData = movieSimilarities.collect()
println(collectedData.length)
log.warn("I'm done") //never gets here
close
Spark does lazy evaluation and creates rdd/df the when an action is called.
To answer you are question
1 .In the collect/Count you are calling two different actions, incase if you are
not persisting the data, which will cause the RDD/df to be re-evaluated, hence
forth more time than anticipated.
In the show only one action. and it shows only top 1000 rows( fingers crossed
) hence it finishes

Spark - how to handle with lazy evaluation in case of iterative (or recursive) function calls

I have a recursive function that needs to compare the results of the current call to the previous call to figure out whether it has reached a convergence. My function does not contain any action - it only contains map, flatMap, and reduceByKey. Since Spark does not evaluate transformations (until an action is called), my next iteration does not get the proper values to compare for convergence.
Here is a skeleton of the function -
def func1(sc: SparkContext, nodes:RDD[List[Long]], didConverge: Boolean, changeCount: Int) RDD[(Long] = {
if (didConverge)
nodes
else {
val currChangeCount = sc.accumulator(0, "xyz")
val newNodes = performSomeOps(nodes, currChangeCount) // does a few map/flatMap/reduceByKey operations
if (currChangeCount.value == changeCount) {
func1(sc, newNodes, true, currChangeCount.value)
} else {
func1(sc, newNode, false, currChangeCount.value)
}
}
}
performSomeOps only contains map, flatMap, and reduceByKey transformations. Since it does not have any action, the code in performSomeOps does not execute. So my currChangeCount does not get the actual count. What that implies, the condition to check for the convergence (currChangeCount.value == changeCount) is going to be invalid. One way to overcome is to force an action within each iteration by calling a count but that is an unnecessary overhead.
I am wondering what I can do to force an action w/o much overhead or is there another way to address this problem?
I believe there is a very important thing you're missing here:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Because of that accumulators cannot be reliably used for managing control flow and are better suited for job monitoring.
Moreover executing an action is not an unnecessary overhead. If you want to know what is the result of the computation you have to perform it. Unless of course the result is trivial. The cheapest action possible is:
rdd.foreach { case _ => }
but it won't address the problem you have here.
In general iterative computations in Spark can be structured as follows:
def func1(chcekpoinInterval: Int)(sc: SparkContext, nodes:RDD[List[Long]],
didConverge: Boolean, changeCount: Int, iteration: Int) RDD[(Long] = {
if (didConverge) nodes
else {
// Compute and cache new nodes
val newNodes = performSomeOps(nodes, currChangeCount).cache
// Periodically checkpoint to avoid stack overflow
if (iteration % checkpointInterval == 0) newNodes.checkpoint
/* Call a function which computes values
that determines control flow. This execute an action on newNodes.
*/
val changeCount = computeChangeCount(newNodes)
// Unpersist old nodes
nodes.unpersist
func1(checkpointInterval)(
sc, newNodes, currChangeCount.value == changeCount,
currChangeCount.value, iteration + 1
)
}
}
I see that these map/flatMap/reduceByKey transformations are updating an accumulator. Therefore the only way to perform all updates is to execute all these functions and count is the easiest way to achieve that and gives the lowest overhead compared to other ways (cache + count, first or collect).
Previous answers put me on the right track to solve a similar convergence detection problem.
foreach is presented in the docs as:
foreach(func) : Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
It seems like instead of using rdd.foreach() as a cheap action to trigger accumulator increments placed in various transformations, it should be used to do the incrementing itself.
I'm unable to produce a scala example, but here's a basic java version, if it can still help:
// Convergence is reached when two iterations
// return the same number of results
long previousCount = -1;
long currentCount = 0;
while (previousCount != currentCount){
rdd = doSomethingThatUpdatesRdd(rdd);
// Count entries in new rdd with foreach + accumulator
rdd.foreach(tuple -> accumulator.add(1));
// Update helper values
previousCount = currentCount;
currentCount = accumulator.sum();
accumulator.reset();
}
// Convergence is reached