How to work around the immutable data frames in Spark/Scala? - scala

I am trying to convert following pyspark code into scala. As you know, the dataframes in scala are immutable, which are constraining me to convert the following code:
pyspark code:
time_frame = ["3m","6m","9m","12m","18m","27m","60m","60m_ab"]
variable_name = ["var1", "var2", "var3"....., "var30"]
train_df = sqlContext.sql("select * from someTable")
for var in variable_name:
for tf in range(1,len(time_frame)):
train_df=train_df.withColumn(str(time_frame[tf]+'_'+var), fn.col(str(time_frame[tf]+'_'+var))+fn.col(str(time_frame[tf-1]+'_'+var)))
So, as you see above the table has different columns which are used to recreate more columns. However the immutable nature of the dataframe in Spark/Scala is objecting, can you help me with some work around?

Here's one approach that first uses a for-comprehension to generate a list of tuples consisting of column name pairs, and then traverses the list using foldLeft to iteratively transform trainDF via withColumn:
import org.apache.spark.sql.functions._
val timeframes: Seq[String] = ???
val variableNames: Seq[String] = ???
val newCols = for {
vn <- variableNames
tf <- 1 until timeframes.size
} yield (timeframes(tf) + "_" + vn, timeframes(tf - 1) + "_" + vn)
val trainDF = spark.sql("""select * from some_table""")
val resultDF = newCols.foldLeft(trainDF)( (accDF, cs) =>
accDF.withColumn(cs._1, col(cs._1) + col(cs._2))
)
To test the above code, simply provide sample input and create table some_table:
val timeframes = Seq("3m", "6m", "9m")
val variableNames = Seq("var1", "var2")
val df = Seq(
(1, 10, 11, 12, 13, 14, 15),
(2, 20, 21, 22, 23, 24, 25),
(3, 30, 31, 32, 33, 34, 35)
).toDF("id", "3m_var1", "6m_var1", "9m_var1", "3m_var2", "6m_var2", "9m_var2")
df.createOrReplaceTempView("some_table")
ResultDF should look like the following:
resultDF.show
// +---+-------+-------+-------+-------+-------+-------+
// | id|3m_var1|6m_var1|9m_var1|3m_var2|6m_var2|9m_var2|
// +---+-------+-------+-------+-------+-------+-------+
// | 1| 10| 21| 33| 13| 27| 42|
// | 2| 20| 41| 63| 23| 47| 72|
// | 3| 30| 61| 93| 33| 67| 102|
// +---+-------+-------+-------+-------+-------+-------+

Related

How to write a function that takes a list of column names of a DataFrame, reorders selected columns the left and preserves unselected columns

I'd like to build a function
def reorderColumns(columnNames: List[String]) = ...
that can be applied to a Spark DataFrame such that the columns specified in columnNames gets reordered to the left, and remaining columns (in any order) remain to the right.
Example:
Given a df with the following 5 columns
| A | B | C | D | E
df.reorderColumns(["D","B","A"]) returns a df with columns ordered like so:
| D | B | A | C | E
Try this one:
def reorderColumns(df: DataFrame, columns: Array[String]): DataFrame = {
val restColumns: Array[String] = df.columns.filterNot(c => columns.contains(c))
df.select((columns ++ restColumns).map(col): _*)
}
Usage example:
val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val df = List((1, 3, 1, 6), (2, 4, 2, 5), (3, 6, 3, 4)).toDF("colA", "colB", "colC", "colD")
reorderColumns(df, Array("colC", "colB")).show
// output:
//+----+----+----+----+
//|colC|colB|colA|colD|
//+----+----+----+----+
//| 1| 3| 1| 6|
//| 2| 4| 2| 5|
//| 3| 6| 3| 4|
//+----+----+----+----+

Looking to substract every value in a row based on the value of a separate DF

As the title states, I would like to subtract each value of a specific column by the mean of that column.
Here is my code attempt:
val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()
You can either use Spark's DataFrame functions or a mere SQL query to a DataFrame to aggregate the values of the means for the columns you are focusing on (rating1, rating2).
val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()
Output for the test input DataFrame moviePairs (with the good ol' double precision loss which you can manage as seen here):
+-------------------+-------+-------+-------------------+-------------------+
| movie|rating1|rating2| meanDeltaX| meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
| Moonlight| 7| 8| 0.5999999999999996| 3.2|
| Lord Of The Drinks| 10| 1| 3.5999999999999996| -3.8|
|The Disaster Artist| 3| 5|-3.4000000000000004|0.20000000000000018|
| Airplane!| 7| 9| 0.5999999999999996| 4.2|
| 2001| 5| 1|-1.4000000000000004| -3.8|
+-------------------+-------+-------+-------------------+-------------------+

Evaluate formulas in Spark DataFrame

Is it possible to evaluate formulas in a dataframe which refer to columns? e.g. if I have data like this (Scala example):
val df = Seq(
( 1, "(a+b)/d", 1, 20, 2, 3, 1 ),
( 2, "(c+b)*(a+e)", 0, 1, 2, 3, 4 ),
( 3, "a*(d+e+c)", 7, 10, 6, 2, 1 )
)
.toDF( "Id", "formula", "a", "b", "c", "d", "e" )
df.show()
Expected results:
I have been unable to get selectExpr, expr, eval() or combinations of them to work.
You can use the scala toolbox eval in a UDF:
import org.apache.spark.sql.functions.col
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val cols = df.columns.tail
val eval_udf = udf(
(r: Seq[String]) =>
tb.eval(tb.parse(
("val %s = %s;" * cols.tail.size).format(
cols.tail.zip(r.tail).flatMap(x => List(x._1, x._2)): _*
) + r(0)
)).toString
)
val df2 = df.select(col("id"), eval_udf(array(df.columns.tail.map(col):_*)).as("result"))
df2.show
+---+------+
| id|result|
+---+------+
| 1| 7|
| 2| 12|
| 3| 63|
+---+------+
A slightly different version of mck's answer, by replacing the variables in the formula column by their corresponding values from the other columns then calling eval udf :
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val eval = udf((f: String) => {
val toolbox = currentMirror.mkToolBox()
toolbox.eval(toolbox.parse(f)).toString
})
val formulaExpr = expr(df.columns.drop(2).foldLeft("formula")((acc, c) => s"replace($acc, '$c', $c)"))
df.select($"Id", eval(formulaExpr).as("result")).show()
//+---+------+
//| Id|result|
//+---+------+
//| 1| 7|
//| 2| 12|
//| 3| 63|
//+---+------+

Scala add new column to dataframe by expression

I am going to add new column to a dataframe with expression.
for example, I have a dataframe of
+-----+----------+----------+-----+
| C1 | C2 | C3 |C4 |
+-----+----------+----------+-----+
|steak|1 |1 | 150|
|steak|2 |2 | 180|
| fish|3 |3 | 100|
+-----+----------+----------+-----+
and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.
Is there a good way to do this?
I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.
And normally I am using df.withColumn to add new column.
Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.
This can be done using expr to create a Column from an expression:
val df = Seq((1,2)).toDF("x","y")
val myExpression = "x+y"
import org.apache.spark.sql.functions.expr
df.withColumn("z",expr(myExpression)).show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Two approaches:
import spark.implicits._ //so that you could use .toDF
val df = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
import org.apache.spark.sql.functions._
// 1st approach using expr
df.withColumn("C5", expr("C2/(C3 + C4)")).show()
// 2nd approach using selectExpr
df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()
+-----+---+---+---+--------------------+
| C1| C2| C3| C4| C5|
+-----+---+---+---+--------------------+
|steak| 1| 1|150|0.006622516556291391|
|steak| 2| 2|180| 0.01098901098901099|
| fish| 3| 3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+
In Spark 2.x, you can create a new column C5 with expression "C2/C3+C4" using withColumn() and org.apache.spark.sql.functions._,
val currentDf = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
val requiredDf = currentDf
.withColumn("C5", (col("C2")/col("C3")+col("C4")))
Also, you can do the same using org.apache.spark.sql.Column as well.
(But the space complexity is bit higher in this approach than using org.apache.spark.sql.functions._ due to the Column object creation)
val requiredDf = currentDf
.withColumn("C5", (new Column("C2")/new Column("C3")+new Column("C4")))
This worked perfectly for me. I am using Spark 2.0.2.

Spark, Scala, DataFrame: create feature vectors

I have a DataFrame that looks like follow:
userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2
The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros.
So the output would be something like:
userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]
It is just an illustrative example, in reality I have about 200,000 unique userID and and 300 unique category.
What is the most efficient way to create the features DataFrame?
A little bit more DataFrame centric solution:
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(Seq(
(1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6),
(2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16),
(3, "cat8", 2))).toDF("userID", "category", "frequency")
// Create a sorted array of categories
val categories = df
.select($"category")
.distinct.map(_.getString(0))
.collect
.sorted
// Prepare vector assemble
val assembler = new VectorAssembler()
.setInputCols(categories)
.setOutputCol("features")
// Aggregation expressions
val exprs = categories.map(
c => sum(when($"category" === c, $"frequency").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(
df.groupBy($"userID").agg(exprs.head, exprs.tail: _*))
.select($"userID", $"features")
and an UDAF alternative:
import org.apache.spark.sql.expressions.{
MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.types.{
StructType, ArrayType, DoubleType, IntegerType}
import scala.collection.mutable.WrappedArray
class VectorAggregate (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType()
.add("i", IntegerType)
.add("v", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = new VectorUDT()
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val i = input.getInt(0)
val v = input.getDouble(1)
val buff = buffer.getAs[WrappedArray[Double]](0)
buff(i) += v
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
with example usage:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("category_idx")
.fit(df)
val indexed = indexer.transform(df)
.withColumn("category_idx", $"category_idx".cast("integer"))
.withColumn("frequency", $"frequency".cast("double"))
val n = indexer.labels.size + 1
val transformed = indexed
.groupBy($"userID")
.agg(new VectorAggregate(n)($"category_idx", $"frequency").as("vec"))
transformed.show
// +------+--------------------+
// |userID| vec|
// +------+--------------------+
// | 1|[1.0,5.0,0.0,3.0,...|
// | 2|[0.0,2.0,0.0,0.0,...|
// | 3|[5.0,0.0,16.0,0.0...|
// +------+--------------------+
In this case order of values is defined by indexer.labels:
indexer.labels
// Array[String] = Array(cat1, cat9, cat7, cat2, cat8, cat4, cat10)
In practice I would prefer solution by Odomontois so these are provided mostly for reference.
Suppose:
val cs: SparkContext
val sc: SQLContext
val cats: DataFrame
Where userId and frequency are bigint columns which corresponds to scala.Long
We are creating intermediate mapping RDD:
val catMaps = cats.rdd
.groupBy(_.getAs[Long]("userId"))
.map { case (id, rows) => id -> rows
.map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
.toMap
}
Then collecting all presented categories in the lexicographic order
val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)
Or creating it manually
val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)
Finally we're transforming maps to arrays with 0-values for non-existing values
import sc.implicits._
val catArrays = catMaps
.map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
.toDF("userId", "feature")
now catArrays.show() prints something like
+------+--------------------+
|userId| feature|
+------+--------------------+
| 2|[0, 1, 0, 6, 0, 0...|
| 1|[1, 0, 3, 0, 0, 0...|
| 3|[5, 0, 0, 0, 16, ...|
+------+--------------------+
This could be not the most elegant solution for dataframes, as I barely familiar with this area of spark.
Note, that you could create your catNames manually to add zeros for missing cat3, cat5, ...
Also note that otherwise catMaps RDD is operated twice, you might want to .persist() it
Given your input:
val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5),
(2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1),
(3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))
.toDF("userID", "category", "frequency")
df.show
+------+--------+---------+
|userID|category|frequency|
+------+--------+---------+
| 1| cat1| 1|
| 1| cat2| 3|
| 1| cat9| 5|
| 2| cat4| 6|
| 2| cat9| 2|
| 2| cat10| 1|
| 3| cat1| 5|
| 3| cat7| 16|
| 3| cat8| 2|
+------+--------+---------+
Just run:
val pivoted = df.groupBy("userID").pivot("category").avg("frequency")
val dfZeros = pivoted.na.fill(0)
dzZeros.show
+------+----+-----+----+----+----+----+----+
|userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9|
+------+----+-----+----+----+----+----+----+
| 1| 1.0| 0.0| 3.0| 0.0| 0.0| 0.0| 5.0|
| 3| 5.0| 0.0| 0.0| 0.0|16.0| 2.0| 0.0|
| 2| 0.0| 1.0| 0.0| 6.0| 0.0| 0.0| 2.0|
+------+----+-----+----+----+----+----+----+
Finally, use VectorAssembler to create a org.apache.spark.ml.linalg.Vector
NOTE: I have not checked performances on this yet...
EDIT: Possibly more complex, but likely more efficient!
def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] {
(data: Seq[Row]) => {
val indices = data.map(_.getDouble(0).toInt).toArray
val values = data.map(_.getInt(1).toDouble).toArray
Vectors.sparse(size, indices, values)
}
}
val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx")
val indexerModel = indexer.fit(df)
val totalCategories = indexerModel.labels.size
val dataWithIndices = indexerModel.transform(df)
val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data"))
val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data")
dataWithFeatures.show(false)
+------+--------------------------+
|userId|features |
+------+--------------------------+
|1 |(7,[0,1,3],[1.0,5.0,3.0]) |
|3 |(7,[0,2,4],[5.0,16.0,2.0])|
|2 |(7,[1,5,6],[2.0,6.0,1.0]) |
+------+--------------------------+
NOTE: StringIndexer will sort categories by frequency => most frequent category will be at index=0 in indexerModel.labels. Feel free to use your own mapping if you'd like and pass that directly to toSparseVectorUdf.