Spark collect()/count() never finishes while show() runs fast - scala

I'm running Spark locally on my Mac and there is a weird issue. Basically, I can output any number of rows using show() method of the DataFrame, however, when I try to use count() or collect() even on pretty small amounts of data, the Spark is getting stuck on that stage. And never finishes its job. I'm using gradle for building and running.
When I run
./gradlew clean run
The program gets stuck at
> Building 83% > :run
What could cause this problem?
Here is the code.
val moviesRatingsDF = MongoSpark.load(sc).toDF().select("movieId", "userId","rating")
val movieRatingsDF = moviesRatingsDF
.groupBy("movieId")
.pivot("userId")
.max("rating")
.na.fill(0)
val ratingColumns = movieRatingsDF.columns.drop(1) // drop the name column
val movieRatingsDS:Dataset[MovieRatingsVector] = movieRatingsDF
.select( col("movieId").as("movie_id"), array(ratingColumns.map(x => col(x)): _*).as("ratings") )
.as[MovieRatingsVector]
val moviePairs = movieRatingsDS.withColumnRenamed("ratings", "ratings1")
.withColumnRenamed("movie_id", "movie_id1")
.crossJoin(movieRatingsDS.withColumnRenamed("ratings", "ratings2").withColumnRenamed("movie_id", "movie_id2"))
.filter(col("movie_id1") < col("movie_id2"))
val movieSimilarities = moviePairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
MovieSimilarity(row.getAs[Long]("movie_id1"), row.getAs[Long]("movie_id2"), corr)
}).cache()
val collectedData = movieSimilarities.collect()
println(collectedData.length)
log.warn("I'm done") //never gets here
close

Spark does lazy evaluation and creates rdd/df the when an action is called.
To answer you are question
1 .In the collect/Count you are calling two different actions, incase if you are
not persisting the data, which will cause the RDD/df to be re-evaluated, hence
forth more time than anticipated.
In the show only one action. and it shows only top 1000 rows( fingers crossed
) hence it finishes

Related

dataset transformations and actions can only be invoked by the driver

I'm getting error
Dataset transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x, see SPARK-28702
Here is my code
do {
val size = accumulatorList.value.size()
val current = accumulatorList.value.get(size -1)
joinoutput = current.join(broadvast.value, current.col("A") === broadvast.value.col("A"))
.map {x =>
val gig = x._1.getAs("Name") + "|" + x._1.getAs("PIP")
MyData(
x._2.getAs("Name"),
x._2.getAs("Place"),
x._2.getAs("Phone"),
x._2.getAs("Doc"),
gig
)}
accumulatorList.add(joinoutput)
joincount = joinoutput.count()
}
while (joincount >0 )
This error is coming while I'm joining in do block. Its running fine in my local/cluster but failing while I'm building my code with test cases. Even my test case is running fine in local, but the problem is only coming while building it.
Can anyone suggest please what's wrong happening and what can be done to fix this ?
Thanks

Killing the spark.sql

I am new to scala and spark both .
I have a code in scala which executes quieres in while loop one after the other.
What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move on to the next one
for example
do {
var f = Future(
spark.sql("some query"))
)
f onSucess {
case suc - > println("Query ran in 10mins")
}
f failure {
case fail -> println("query took more than 10mins")
}
}while(some condition)
var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))
I understand that when we call spark.sql the control is sent to spark which i need to kill/stop when the duration is over so that i can get back the resources
I have tried multiple things but I am not sure how to solve this.
Any help would be welcomed as i am stuck with this.

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Method for reducing memory load of Spark program

I have a Spark program with calculates relations between users, i.e. it receives data set of type:
RDD[(java.lang.Long, Map[(String, String), Integer])]
Where the Long is timestamp, and the map is a score relevant to tuples of two users. and should run some function over the scores and return the following type:
Map[String, Map[java.lang.Long, java.lang.Double]]
Where the String is the first String in the tuple, and the map is the results of the function per timeslot.
In my case I have around 2000 users so the maps I receive are quite big (2000^2 per timeslot), and also the results relies on the previous timeslot results.
I am running the program locally and receiving GC overhead limit exceeded. I increased the heap memory to 14g using: -Xmx14G in vmarguments (I see the java process is occupying more than 12g of memory) but it didn't help.
Currently implemented method
I have tried several directions to decrease the memory consumption and currently came up with the following idea: since every timestamp relies only on the previous one I will collect every timeslot separately and keep the previous results on driver. In this manner I will run calculations only on part of the data and hopefully it will not crush the program.
The code:
def calculateScorePerTimeslot(scorePerTimeslotRDD: RDD[(java.lang.Long, Map[(String, String), Integer])]): Map[String, Map[java.lang.Long, java.lang.Double]] = {
var distancesPerTimeslotVarRDD = distancesPerTimeslotRDD.groupBy(_._1).sortBy(_._1)
println("Start collecting all the results - cache the data!!")
distancesPerTimeslotVarRDD.cache()
println("Caching all the data has completed!")
while(!distancesPerTimeslotVarRDD.isEmpty())
{
val dataForTimeslot: (java.lang.Long, Iterable[(java.lang.Long, Map[(String, String), Integer])]) = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
//Code which is irrelevant for question - logic
println("Removing timeslot: " + dataForTimeslot._1)
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1))
println("Filtering has complete! - without: " + dataForTimeslot._1)
}
}
Summary: Basically, the idea is to extract one timeslot at a time process it and save the results at driver - in this manner I try to reduce the size of data which passes on collect.
Reason I write this post
Unfortunately, this doesn't help me and the program still dies. My question is: is this manner of taking the first() item of a RDD and then filter it have the effect of iterating over the items on RDD? Are there other better ideas to tackle this kinds of question (better ideas which are not increasing the memory or moving to a real distributed cluster)?
Firstly, RDD[(java.lang.Long, Map[(String, String), Integer])] uses more memory than RDD[(java.lang.Long, Array[(String, String, Integer)])]. You'll save some memory if you can use the latter.
Secondly, your loop is pretty inefficient in caching data. Always call unpersist on any RDD you no longer need.
distancesPerTimeslotVarRDD.cache()
var rddSize = distancesPerTimeslotVarRDD.count()
println("Caching all the data has completed!")
while(rddSize > 0) {
val prevRDD = distancesPerTimeslotVarRDD
val dataForTimeslot = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
// Code which is irrelevant for answer - logic
println("Removing timeslot: " + dataForTimeslot._1)
// Cache the new value of distancesPerTimeslotVarRDD
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1)).cache()
// Force calculation so we can throw away previous iteration value
rddSize = distancesPerTimeslotVarRDD.count()
println("Filtering has complete! - without: " + dataForTimeslot._1)
// Get rid of previously cached RDD
prevRDD.unpersist(false)
}
Thirdly, you can try using Kryo Serializer, though this sometimes makes things worse. You have to configure the serializer and replace cache with persist(StorageLevel.MEMORY_ONLY_SER)