dataset transformations and actions can only be invoked by the driver - scala

I'm getting error
Dataset transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x, see SPARK-28702
Here is my code
do {
val size = accumulatorList.value.size()
val current = accumulatorList.value.get(size -1)
joinoutput = current.join(broadvast.value, current.col("A") === broadvast.value.col("A"))
.map {x =>
val gig = x._1.getAs("Name") + "|" + x._1.getAs("PIP")
MyData(
x._2.getAs("Name"),
x._2.getAs("Place"),
x._2.getAs("Phone"),
x._2.getAs("Doc"),
gig
)}
accumulatorList.add(joinoutput)
joincount = joinoutput.count()
}
while (joincount >0 )
This error is coming while I'm joining in do block. Its running fine in my local/cluster but failing while I'm building my code with test cases. Even my test case is running fine in local, but the problem is only coming while building it.
Can anyone suggest please what's wrong happening and what can be done to fix this ?
Thanks

Related

Killing the spark.sql

I am new to scala and spark both .
I have a code in scala which executes quieres in while loop one after the other.
What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move on to the next one
for example
do {
var f = Future(
spark.sql("some query"))
)
f onSucess {
case suc - > println("Query ran in 10mins")
}
f failure {
case fail -> println("query took more than 10mins")
}
}while(some condition)
var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))
I understand that when we call spark.sql the control is sent to spark which i need to kill/stop when the duration is over so that i can get back the resources
I have tried multiple things but I am not sure how to solve this.
Any help would be welcomed as i am stuck with this.

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

Spark collect()/count() never finishes while show() runs fast

I'm running Spark locally on my Mac and there is a weird issue. Basically, I can output any number of rows using show() method of the DataFrame, however, when I try to use count() or collect() even on pretty small amounts of data, the Spark is getting stuck on that stage. And never finishes its job. I'm using gradle for building and running.
When I run
./gradlew clean run
The program gets stuck at
> Building 83% > :run
What could cause this problem?
Here is the code.
val moviesRatingsDF = MongoSpark.load(sc).toDF().select("movieId", "userId","rating")
val movieRatingsDF = moviesRatingsDF
.groupBy("movieId")
.pivot("userId")
.max("rating")
.na.fill(0)
val ratingColumns = movieRatingsDF.columns.drop(1) // drop the name column
val movieRatingsDS:Dataset[MovieRatingsVector] = movieRatingsDF
.select( col("movieId").as("movie_id"), array(ratingColumns.map(x => col(x)): _*).as("ratings") )
.as[MovieRatingsVector]
val moviePairs = movieRatingsDS.withColumnRenamed("ratings", "ratings1")
.withColumnRenamed("movie_id", "movie_id1")
.crossJoin(movieRatingsDS.withColumnRenamed("ratings", "ratings2").withColumnRenamed("movie_id", "movie_id2"))
.filter(col("movie_id1") < col("movie_id2"))
val movieSimilarities = moviePairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
MovieSimilarity(row.getAs[Long]("movie_id1"), row.getAs[Long]("movie_id2"), corr)
}).cache()
val collectedData = movieSimilarities.collect()
println(collectedData.length)
log.warn("I'm done") //never gets here
close
Spark does lazy evaluation and creates rdd/df the when an action is called.
To answer you are question
1 .In the collect/Count you are calling two different actions, incase if you are
not persisting the data, which will cause the RDD/df to be re-evaluated, hence
forth more time than anticipated.
In the show only one action. and it shows only top 1000 rows( fingers crossed
) hence it finishes

subgraphing in a foreach loop in spark and graphx problems

Hoping somebody can help.
I'm trying to write a program which needs to carry out a function on each edge ID connected to each node in a network on graphx.
To do this I want to iterate over each node and identify all edges connected to it, I then want to iterate over each edge with a function. My problem seems to arise when doing any kind of subgraphing or filtering within a foreach loop.
So for example the code below should output the id of each edge connected to a node
graph.vertices.foreach {
network =>
val KeyVert = network._1
val EGraph = graph.subgraph(e => e.dstId == KeyVert)
println(KeyVert)
EGraph.edges.foreach(println)
}
However it will only work if you add the collect function to collect the graph data from the rdd e.g.
graph.vertices.collect.foreach {
network =>
val KeyVert = network._1
val EGraph = graph.subgraph(e => e.dstId == KeyVert)
println(KeyVert)
EGraph.edges.foreach(println)
}
The network is too large to be collecting edge data so any help would be much appreciated.
em...the problem is you did not understand the driver and the worker...when you call collect function, all data are collected to the driver, and then foreach function looks work well. In fact, graph.vertices.foreach did not report any error, right? because it works well really, just println the info at the worker's log. you know what I said? hope it helps.
graph.vertices.map {
network =>
val KeyVert = network._1
val EGraph = graph.subgraph(e => e.dstId == KeyVert)
println(KeyVert)
EGraph.edges.map(println)
}
That may solve your problem.

Reading sql file using getResources in scala

I'm trying to read and execute a sql in SPARK SQL.
sqlContext.sql(scala.io.Source.fromInputStream(getClass.getResourceAsStream("/" + "dq.sql")).getLines.mkString(" ").stripMargin).take(1)
My sql is very long. When I run it straight way in spark shell , it runs fine. When I try to read this using getResourcesAsStream - I'm hitting
java.lang.RuntimeException: [1.10930] failure: end of input
A simple solution could be reading the sql at driver (using any file utility) and pass the variable like ssc.sql(sqlvar)
val stream : InputStream = getClass.getResourceAsStream("/filename.txt")
val readFile = scala.io.Source.fromInputStream( stream ).getLines
val spa = readFile.map(line => " " + line)
val spl = spa.mkString.split(";")
for (m1 <- spl) {
sqlContext.sql(m1)
}