how to update an RDD - scala

I have and an RDD[(Int,Array[Double],Double, Double)].
val full_data = rdd.map(row => {
val label = row._1
val feature = row._2.map(_.toDouble)
val QD = k_function(feature)
val alpha = 0.0
(label,feature,QD,alpha)
})
Now I want to update the value of alpha in each record (say 10)
var tmp = full_data.map( x=> {
x._4 = 10
})
I got the error
Error: reassignment to val
x._4 = 10
I have changed the all the val to var but still, the error occurs. How to update the value of alpha. and I would like to know how to update the full row or a specific row in an RDD.

RDD's are immutable in nature. They are made so for easy caching, sharing and replicating. Its always safe to copy than to mutate in a multi-threaded system like spark for fault tolerance and correctness in processing. Recreation of immutable data is much easier than mutable data.
Transformation is like copying the RDD data to another RDD every variables are treated as val i.e. they are immutable so if you are looking to replace the last double with 10, you can do is
var tmp = full_data.map( x=> {
(x._1, x._2, x._3, 10)
})

Related

spark scala: Performance degrade with simple UDF over large number of columns

I have a dataframe with 100 million rows and ~ 10,000 columns. The columns are of two types, standard (C_i) followed by dynamic (X_i). This dataframe was obtained after some processing, and the performance was fast. Now only 2 steps remain:
Goal:
A particular operation needs to be done on every X_i using identical subset of C_i columns.
Convert each of X-i column into FloatType.
Difficulty:
Performance degrades terribly with increasing number of columns.
After a while, only 1 executor seems to work (%CPU use < 200%), even on a sample data with 100 rows and 1,000 columns. If I push it to 1,500 columns, it crashes.
Minimal code:
import spark.implicits._
import org.apache.spark.sql.types.FloatType
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val columns = Seq("C1", "C2", "X1", "X2", "X3", "X4")
val data = Seq(("abc", "212", "1", "2", "3", "4"),("def", "436", "2", "2", "1", "8"),("abc", "510", "1", "2", "5", "8"))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd).toDF(columns:_*)
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols, foos_udf(col("C2"),col(cols)))
}
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols,col(cols).cast(FloatType))
}
df.show()
Error on 1,500 column data:
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.isStreaming(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
...
Thoughts:
Perhaps var could be replaced, but the size of the data is close to 40% of the RAM.
Perhaps for loop for dtype casting could be causing degradation of performance, though I can't see how, and what are the alternatives. From searching on internet, I have seen people suggesting foldLeft based approach, but that apparently still gets translated to for loop internally.
Any inputs on this would be greatly appreciated.
A faster solution was to call UDF on row itself rather than calling on each column. As Spark stores data as rows, the earlier approach was exhibiting terrible performance.
def my_udf(names: Array[String]) = udf[String,Row]((r: Row) => {
val row = Array.ofDim[String](names.length)
for (i <- 0 until row.length) {
row(i) = r.getAs(i)
}
...
}
...
val df2 = df1.withColumn(results_col,my_udf(df1.columns)(struct("*"))).select(col(results_col))
Type casting can be done as suggested by Riccardo
not sure if this will fix the performance on your side with 10000~ columns, but I was able to run it locally with 1500 using the following code.
I addressed points #1 and #2, which may have had some impact on performance. One note, to my understanding foldLeft should be a pure recursive function without an internal for loop, so it might have an impact on performance in this case.
Also, the two for loops can be simplified into a single for loop that I refactored as foldLeft.
We might also get a performance increase if we replace the udf with a spark function.
import spark.implicits._
import org.apache.spark.sql.types.FloatType
import org.apache.spark.sql.functions._
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val numberOfColumns = 1500
val numberOfRows = 100
val colNames = (1 to numberOfColumns).map(s => s"X$s")
val colValues = (1 to numberOfColumns).map(_.toString)
val columns = Seq("C1", "C2") ++ colNames
val schema = StructType(columns.map(field => StructField(field, StringType)))
val rowFields = Seq("abc", "212") ++ colValues
val listOfRows = (1 to numberOfRows).map(_ => Row(rowFields: _*))
val listOfRdds = spark.sparkContext.parallelize(listOfRows)
val df = spark.createDataFrame(listOfRdds, schema)
df.show()
val newDf = df.columns.drop(2).foldLeft(df)((df, colName) => {
df.withColumn(colName, foos_udf(col("C2"), col(colName)) cast FloatType)
})
newDf.show()
Hope this helps!
*** EDIT
Found a way better solution that circumvents loops. Simply make a single expression with SelectExpr, this way sparks casts all columns in one go without any kind of recursion. From my previous example:
instead of doing fold left, just replace it with these lines. I just tested it with 10k columns 100 rows in my local computer, lasted a few seconds
val selectExpression = Seq("C1", "C2") ++ colNames.map(s => s"cast($s as float)")
val newDf = df.selectExpr(selectExpression:_*)

How to filter an rdd by data type?

I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.
Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.

how to design the Spark application so that Shuffle data will be automatically cleaned up after some iterations

In the Spark core "example" directory (I am using Spark 1.2.0), there is an example called "SparkPageRank.scala",
val sparkConf = new SparkConf().setAppName("PageRank")
val iters = if (args.length > 0) args(1).toInt else 10
val ctx = new SparkContext(sparkConf)
val lines = ctx.textFile(args(0), 1)
val links = lines.map{ s =>
val parts = s.split("\\s+")
(parts(0), parts(1))
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
val output = ranks.collect()
ctx.stop()
}
I realize that in this example, the lineage will keep extending after each iteration. As a result, when I monitored the directory that holds the shuffle data, the shuffle data storage keeps increasing after each iteration.
How should I structure the application code, so that the ContextCleaner's doCleanupShuffle will be activated after certain interval (say, several iterations), so that I can prevent the ever-increasing of the shuffle data storage for computation that takes many iterations?
Jun
Apparently the clean-up of the shuffle files happens when the objects used for the shuffle are GCed. Since your snippet is a simple example of Page rank, I assume you are running it on a very small dataset, therefore your memory never exceeds the size of the heap and the objects are never GCed. Try with a bigger file or trigger the GC manually (although it is usually not recommended).
More information on : https://github.com/apache/spark/pull/5074/files
By the way in your example, it would be more efficient to partition the data instead of shuffling it every time.

My RDD change his values himself

I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.
RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.
RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.

Efficient countByValue of each column Spark Streaming

I want to find countByValues of each column in my data. I can find countByValue() for each column (e.g. 2 columns now) in basic batch RDD as fallows:
scala> val double = sc.textFile("double.csv")
scala> val counts = sc.parallelize((0 to 1).map(index => {
double.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
}))
scala> counts.take(2)
res20: Array[scala.collection.Map[Long,Long]] = Array(Map(2 -> 5, 1 -> 5), Map(4 -> 5, 5 -> 5))
Now I want to perform same with DStreams. I have windowedDStream and want to countByValue on each column. My data has 50 columns. I have done it as fallows:
val windowedDStream = myDStream.window(Seconds(2), Seconds(2)).cache()
ssc.sparkContext.parallelize((0 to 49).map(index=> {
val counts = windowedDStream.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
counts.print()
}))
val topCounts = counts.map . . . . will not work
I get correct results with this, the only issue is that I want to apply more operations on counts and it's not available outside map.
You misunderstand what parallelize does. You think when you give it a Seq of two elements, those two elements will be calculated in parallel. That it not the case and it would be impossible for it to be the case.
What parallelize actually does is it creates an RDD from the Seq that you provided.
To try to illuminate this, consider that this:
val countsRDD = sc.parallelize((0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
})
Is equal to this:
val counts = (0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
}
val countsRDD = sc.parallelize(counts)
By the time parallelize runs, the work has already been performed. parallelize cannot retroactively make it so that the calculation happened in parallel.
The solution to your problem is to not use parallelize. It is entirely pointless.