Weighted average with Spark Datasets without UDF - scala

While someone has already asked about computing a Weighted Average in Spark, in this question, I'm asking about using Datasets/DataFrames instead of RDDs.
How do I compute a weighted average in Spark? I have two columns: counts and previous averages:
case class Stat(name:String, count: Int, average: Double)
val statset = spark.createDataset(Seq(Stat("NY", 1,5.0),
I would like to be able to compute a weighted average like this:
One can use a UDF to get close:
val weightedAverage = udf(
val counts = row.getAs[WrappedArray[Int]](0)
val averages = row.getAs[WrappedArray[Double]](1)
val (count,total) = (counts zip averages).foldLeft((0,0.0)){
(total/count) // Tested by returning count here and then extracting. Got same result as sum.
(Thanks to answers to Passing a list of tuples as a parameter to a spark udf in scala for help in writing this)
Newbies: Use these imports:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection.mutable.WrappedArray
Is there a way of accomplishing this with built-in column functions instead of UDFs? The UDF feels clunky and if the numbers get large you have to convert the Int's to Long's.

Looks like you could do it in two passes:
val totalCount = statset.select(sum($"count")).collect.head.getLong(0)
statset.select(lit(totalCount) as "count", sum($"average" * $"count" / lit(totalCount)) as "average").show
Or, including the groupBy you just added:


The proper way to compute correlation between two Seq columns into a third column

I have a DataFrame where each row has 3 columns:
ID:Long, ratings1:Seq[Double], ratings2:Seq[Double]
For each row I need to compute the correlation between those Vectors.
I came up with the following solution which seems to be inefficient (not working as Jarrod Roberson has mentioned) as I have to create RDDs for each Seq:
val similarities = ratingPairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
Similarity(row.getAs[Long]("ID"), corr)
Is there a way to compute such correlations properly?
Let's assume you have a correlation function for arrays:
def correlation(arr1: Array[Double], arr2: Array[Double]): Double
(for potential implementations of that function, which is completely independent of Spark, you can ask a separate question or search online, there are some close-enough resource, e.g. this implementation).
Now, all that's left to do is to wrap this function with a UDF and use it:
import org.apache.spark.sql.functions._
import spark.implicits._
val corrUdf = udf {
(arr1: Seq[Double], arr2: Seq[Double]) => correlation(arr1.toArray, arr2.toArray)
val result = df.select($"ID", corrUdf($"ratings1", $"ratings2") as "correlation")

Spark UDF called more than once per record when DF has too many columns

I'm using Spark 1.6.1 and encountering a strange behaviour: I'm running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input data, and building up a result-Dataframe containing many columns (~40).
Strangely, my UDF is called more than once per Record of my input Dataframe in this case (1.6 times more often), which I find unacceptable because its very expensive. If I reduce the number of columns (e.g. to 20), then this behavior disappears.
I managed to write down a small script which demonstrates this:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo {
case class Result(a: Double)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val numRuns = sc.accumulator(0) // to count the number of udf calls
val myUdf = udf((i:Int) => {numRuns.add(1);Result(i.toDouble)})
val data = sc.parallelize((1 to 100), numSlices = 5).toDF("id")
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a")
// add many columns to dataframe (must depend on the UDF's result)
for (i <- 1 to 42) {
// trigger action
val res = results.collect()
println(res.size) // prints 100
println(numRuns.value) // prints 160
Now, is there a way to solve this without reducing the number of columns?
I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a").cache()
Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.
We had this same problem about a year ago and spent a lot of time till we finally figured out what was the problem.
We also had a very expensive UDF to calculate and we found out that it gets calculated again and again for every time we refer to its column. Its just happened to us again a few days ago, so I decided to open a bug on this:
We also opened a question here then, but now I see the title wasn't so good:
Trying to turn a blob into multiple columns in Spark
I agree with Tzach about somehow "forcing" the plan to calculate the UDF. We did it uglier, but we had to, because we couldn't cache() the data - it was too big:
val df = data.withColumn("tmp", myUdf($"id"))
val results = sqlContext.createDataFrame(df.rdd, df.schema)
.withColumn("result", $"tmp.a")
Now I see that my jira ticket was linked to another one: SPARK-17728, which still didn't really handle this issue the right way, but it gives one more optional work around:
val results = data.withColumn("tmp", explode(array(myUdf($"id"))))
.withColumn("result", $"tmp.a")
In newer spark verion (2.3+) we can mark UDFs as non-deterministic: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/UserDefinedFunction.html#asNondeterministic():org.apache.spark.sql.expressions.UserDefinedFunction
i.e. use
val myUdf = udf(...).asNondeterministic()
This makes sure the UDF is only called once

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
from pyspark.ml.feature import Imputer
imputer = Imputer(
outputCols=["{}_imputed".format(c) for c in df.columns]
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.select(df.columns.map(mean(_)): _*).first.toSeq
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
import org.apache.spark.sql.functions.{col, isnan, when}
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Spark Build Custom Column Function, user defined function

I’m using Scala and want to build my own DataFrame function. For example, I want to treat a column like an array , iterate through each element and make a calculation.
To start off, I’m trying to implement my own getMax method. So column x would have the values [3,8,2,5,9], and the expected output of the method would be 9.
Here is what it looks like in Scala
def getMax(inputArray: Array[Int]): Int = {
var maxValue = inputArray(0)
for (i <- 1 until inputArray.length if inputArray(i) > maxValue) {
maxValue = inputArray(i)
This is what I have so far, and get this error
"value length is not a member of org.apache.spark.sql.column",
and I don't know how else to iterate through the column.
def getMax(col: Column): Column = {
var maxValue = col(0)
for (i <- 1 until col.length if col(i) > maxValue){
maxValue = col(i)
Once I am able to implement my own method, I will create a column function
val value_max:org.apache.spark.sql.Column=getMax(df.col(“value”)).as(“value_max”)
And then I hope to be able to use this in a SQL statement, for example
val sample = sqlContext.sql("SELECT value_max(x) FROM table")
and the expected output would be 9, given input column [3,8,2,5,9]
I am following an answer from another thread Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame where they create a private method for standard deviation.
The calculations I will do will be more complex than this, (e.g I will be comparing each element in the column) , am I going in the correct directions or should I be looking more into User Defined Functions?
In a Spark DataFrame, you can't iterate through the elements of a Column using the approaches you thought of because a Column is not an iterable object.
However, to process the values of a column, you have some options and the right one depends on your task:
1) Using the existing built-in functions
Spark SQL already has plenty of useful functions for processing columns, including aggregation and transformation functions. Most of them you can find in the functions package (documentation here). Some others (binary functions in general) you can find directly in the Column object (documentation here). So, if you can use them, it's usually the best option. Note: don't forget the Window Functions.
2) Creating an UDF
If you can't complete your task with the built-in functions, you may consider defining an UDF (User Defined Function). They are useful when you can process each item of a column independently and you expect to produce a new column with the same number of rows as the original one (not an aggregated column). This approach is quite simple: first, you define a simple function, then you register it as an UDF, then you use it. Example:
def myFunc: (String => String) = { s => s.toLowerCase }
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
val newDF = df.withColumn("newCol", myUDF(df("oldCol")))
For more information, here's a nice article.
3) Using an UDAF
If your task is to create aggregated data, you can define an UDAF (User Defined Aggregation Function). I don't have a lot of experience with this, but I can point you to a nice tutorial:
4) Fall back to RDD processing
If you really can't use the options above, or if you processing task depends on different rows for processing one and it's not an aggregation, then I think you would have to select the column you want and process it using the corresponding RDD. Example:
val singleColumnDF = df("column")
val myRDD = singleColumnDF.rdd
// process myRDD
So, there was the options I could think of. I hope it helps.
An easy example is given in the excellent documentation, where a whole section is dedicated to UDFs:
import org.apache.spark.sql._
val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (v: Int) => v * v)
df.select($"id", callUDF("simpleUDF", $"value"))

print CoordinateMatrix after using RowMatrix.columnSimilarities in Apache Spark

I am using spark mllib for one of my projects in which I need to calculate document similarities.
I first converted the documents to vectors using tf-idf transform of the mllib, then converted it into RowMatrix and used the columnSimilarities() method.
I referred to tf-idf documentation and used the DIMSUM implementation for cosine similarities.
in spark-shell this is the scala code is executed:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix
Now let's say my input file, test1 in this code block is a simple file with 5 short documents (less than 10 terms each), one on each row.
Since I am just testing this code, I would like to see the output of mat.columnSimilarities() which is in object sim.
I would like to see the similarity of 1st document vector with 2nd, 3rd and so on.
I referred to spark documentation for CoordinateMatrix which is the type of object returned by columnSimilarities method of RowMatrix class and referred by sim.
By going through more documentation, I figured I could convert the CoordinateMatrix to RowMatrix, then convert the rows of RowMatrix to arrays and then print like this println(sim.toRowMatrix().rows.toArray().mkString("\n")) .
But that gives some output which I couldn't understand.
Can anyone help? Any kind of resource links etc would help a lot!
You can try the following, no need to convert to row matrix format
val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
To retrieve the elements you can invoke the following action