Split Spark dataframe and calculate average based on one column value - scala

I have two dataframes, the first dataframe classRecord has 10 different entries like the following:
Class, Calculation
first, Average
Second, Sum
Third, Average
Second dataframe studentRecord has around 50K entries like the following:
Name, height, Camp, Class
Shae, 152, yellow, first
Joe, 140, yellow, first
Mike, 149, white, first
Anne, 142, red, first
Tim, 154, red, Second
Jake, 153, white, Second
Sherley, 153, white, Second
From second dataframe, based on class type, I would like to perform calculation on height (for class first: average, for class second: sum, etc.) based on the camp separately (if class is first, avg of yellow, white and so on separately).
I tried the following code:
//function to calculate average
def averageOnName(splitFrame : org.apache.spark.sql.DataFrame ) : Array[(String, Double)] = {
val pairedRDD: RDD[(String, Double)] = splitFrame.select($"Name",$"height".cast("double")).as[(String, Double)].rdd
var avg_by_key = pairedRDD.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(y => 1.0 * y._1 / y._2).collect
return avg_by_key
}
//required schema for further modifications
val schema = StructType(
StructField("name", StringType, false) ::
StructField("avg", DoubleType, false) :: Nil)
// for each loop on each class type
classRecord.rdd.foreach{
//filter students based on camps
var campYellow =studentRecord.filter($"Camp" === "yellow")
var campWhite =studentRecord.filter($"Camp" === "white")
var campRed =studentRecord.filter($"Camp" === "red")
// since I know that calculation for first class is average, so representing calculation only for class first
val avgcampYellow = averageOnName(campYellow)
val avgcampWhite = averageOnName(campWhite)
val avgcampRed = averageOnName(campRed)
// union of all
val rddYellow = sc.parallelize (avgcampYellow).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfYellow = sqlContext.createDataFrame(rddYellow, schema)
//union with yellow camp data
val rddWhite = sc.parallelize (avgcampWhite).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfWhite = sqlContext.createDataFrame(rddWhite, schema)
var dfYellWhite = dfYellow.union(dfWhite)
//union with yellow,white camp data
val rddRed = sc.parallelize (avgcampRed).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfRed = sqlContext.createDataFrame(rddRed, schema)
var dfYellWhiteRed = dfYellWhite .union(dfRed)
// other modifications and final result to hive
}
Here I am struggling with:
Hardcoding yellow, red and white, there may be additional camp types as well.
The dataframe is currently being filtered many times which could be improved.
I'm not able to figure out how to calculate differently according to class calculation type (i.e. use sum/averge depending on the class type).
Any help is appreciated.

You could simply do the average and sum calculations for all combinations of Class/Camp and then parse the classRecord dataframe separately and extract what you need. You can do this easily in spark by using the groupBy() method and aggregate the values.
Using your example dataframe:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
studentRecord.show()
+-------+------+------+------+
| Name|height| Camp| Class|
+-------+------+------+------+
| Shae| 152|yellow| first|
| Joe| 140|yellow| first|
| Mike| 149| white| first|
| Anne| 142| red| first|
| Tim| 154| red|Second|
| Jake| 153| white|Second|
|Sherley| 153| white|Second|
+-------+------+------+------+
val df = studentRecord.groupBy("Class", "Camp")
.agg(
sum($"height").as("Sum"),
avg($"height").as("Average"),
collect_list($"Name").as("Names")
)
df.show()
+------+------+---+-------+---------------+
| Class| Camp|Sum|Average| Names|
+------+------+---+-------+---------------+
| first| white|149| 149.0| [Mike]|
| first| red|142| 142.0| [Anne]|
|Second| red|154| 154.0| [Tim]|
|Second| white|306| 153.0|[Jake, Sherley]|
| first|yellow|292| 146.0| [Shae, Joe]|
+------+------+---+-------+---------------+
After doing this, you can simply check your first classRecord dataframe after which rows you need. Example of what it can look like, can be changed after your actual needs:
// Collects the dataframe as an Array[(String, String)]
val classRecs = classRecord.collect().map{case Row(clas: String, calc: String) => (clas, calc)}
for (classRec <- classRecs){
val clas = classRec._1
val calc = classRec._2
// Matches which calculation you want to do
val df2 = calc match {
case "Average" => df.filter($"Class" === clas).select("Class", "Camp", "Average")
case "Sum" => df.filter($"Class" === clas).select("Class", "Camp", "Sum")
}
// Do something with df2
}
Hope it helps!

Related

Spark 2.3: subtract dataframes but preserve duplicate values (Scala)

Copying example from this question:
As a conceptual example, if I have two dataframes:
words = [the, quick, fox, a, brown, fox]
stopWords = [the, a]
then I want the output to be, in any order:
words - stopWords = [quick, brown, fox, fox]
ExceptAll can do this in 2.4 but I cannot upgrade. The answer in the linked question is specific to a dataframe:
words.join(stopwords, words("id") === stopwords("id"), "left_outer")
.where(stopwords("id").isNull)
.select(words("id")).show()
as in you need to know the pkey and the other columns.
Can anyone come up with an answer that will work on any dataframe?
Here is an implementation for you all. I have tested in Spark 2.4.2, it should work for 2.3 too (not 100% sure)
val df1 = spark.createDataset(Seq("the","quick","fox","a","brown","fox")).toDF("c1")
val df2 = spark.createDataset(Seq("the","a")).toDF("c1")
exceptAllCustom(df1, df2, Seq("c1")).show()
def exceptAllCustom(df1 : DataFrame, df2 : DataFrame, pks : Seq[String]): DataFrame = {
val notNullCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName).isNull)
val joinCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName)=== df1(cName))
val result = df1.join(df2, joinCondition, "left_outer")
.where(notNullCondition)
pks.foldLeft(result)((df,cName) => df.drop(df2(cName)))
}
Result -
+-----+
| c1|
+-----+
|quick|
| fox|
|brown|
| fox|
+-----+
Turns out it's easier to do df1.except(df2) and then join the results with df1 to get all the duplicates.
Full code:
def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = {
val except = df1.except(df2)
val columns = df1.columns
val colExpr: Column = df1(columns.head) <=> except(columns.head)
val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) =>
colExpr && df1(p) <=> except(p)
}
val join = df1.join(except, joinExpression, "inner")
join.select(df1("*"))
}

How to extract efficiently multiple columns from a single string column RDD?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})

filter list to first 2 case classes per parameter value in scala dataset

I have a spark dataset like this:
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966|[[2133, red]...|
|41345063|[[11353, red...|
|41346177|[[2996, yellow]...|
|41349171|[[8477, green]...|
res98: org.apache.spark.sql.Dataset[userItems] = [uid: int, recommendations: array<struct<iid:int,color:string>>]
I want to filter each recommendations array, to contain the first two of each color. Pseudo example:
[(13,'red'), (4,'green'), (8,'red'), (2,'red'), (10, 'yellow')]
would become
[(13,'red'), (4,'green'), (8,'red'), (10, 'yellow')]
How can I efficiently do this in scala with datasets? Is there an elegant solution using something like reduceGroups?
What I have so far:
case class itemData (iid: Int, color: String)
val filterList = (recs: Array[itemData], filterAttribute, maxCount) => {
// filter the list somehow... using the max count and attribute
})
dataset.map(d => filterList(d.recommendations, "color", 2))
You can explode the recommendations, then create a row number partition by uid and color and finally filter out the row numbers greater than 2. The code should look like below. I hope it helps.
//Creating Test Data
val df = Seq((13,"red"), (4,"green"), (8,"red"), (2,"red"), (10, "yellow")).toDF("iid", "color")
.withColumn("uid", lit(41344966))
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
df.show(false)
+--------+----------------------------------------------------+
|uid |recommendations |
+--------+----------------------------------------------------+
|41344966|[[13,red], [4,green], [8,red], [2,red], [10,yellow]]|
+--------+----------------------------------------------------+
val filterDF = df.withColumn("rec", explode(col("recommendations")))
.withColumn("iid", col("rec.iid"))
.withColumn("color", col("rec.color"))
.drop("recommendations", "rec")
.withColumn("rownum",
row_number().over(Window.partitionBy("uid", "color").orderBy(col("iid").desc)))
.filter(col("rownum") <= 2)
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
filterDF.show(false)
+--------+-------------------------------------------+
|uid |recommendations |
+--------+-------------------------------------------+
|41344966|[[4,green], [13,red], [8,red], [10,yellow]]|
+--------+-------------------------------------------+

Calculating edit distance on successive rows of a `Spark Dataframe

I have a data frame as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
// some data...
val df = Seq(
(1, "AA", "BB", ("AA", "BB")),
(2, "AA", "BB", ("AA", "BB")),
(3, "AB", "BB", ("AB", "BB"))
).toDF("id","name", "surname", "array")
df.show()
and i am looking to calculate the edit distance between the 'array' column in successive row. As an example i want to calculate the edit distance between the 'array' entity in column 1 ("AA", "BB") and the the 'array' entity in column 2 ("AA", "BB"). Here is the edit distance function i am using:
def editDist2[A](a: Iterable[A], b: Iterable[A]): Int = {
val startRow = (0 to b.size).toList
a.foldLeft(startRow) { (prevRow, aElem) =>
(prevRow.zip(prevRow.tail).zip(b)).scanLeft(prevRow.head + 1) {
case (left, ((diag, up), bElem)) => {
val aGapScore = up + 1
val bGapScore = left + 1
val matchScore = diag + (if (aElem == bElem) 0 else 1)
List(aGapScore, bGapScore, matchScore).min
}
}
}.last
}
I know i need to create a UDF for this function but can't seem to be able to. If i use the function as is and using Spark Windowing to get at the pervious row:
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
i get the following error:
<console>:245: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Iterable[?]
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
I figured out you can use Spark's own levenshtein function for this. This function takes in two string to compare, so it can't be used with the array.
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", levenshtein(($"name"), lag("name", 1).over(window)) + levenshtein(($"surname"), lag("surname", 1).over(window))).show()
giving the desired output:
+---+----+-------+--------+------+
| id|name|surname| array|edit-d|
+---+----+-------+--------+------+
| 1| AA| BB|[AA, BB]| null|
| 2| AA| BB|[AA, BB]| 0|
| 3| AB| BB|[AB, BB]| 1|
+---+----+-------+--------+------+

SumProduct in Spark DataFrame

I want to create essentially a sumproduct across columns in a Spark DataFrame. I have a DataFrame that looks like this:
id val1 val2 val3 val4
123 10 5 7 5
I also have a Map that looks like:
val coefficents = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
I want to take the value in each column of the DataFrame, multiply it by the corresponding value from the map, and return the result in a new column so essentially:
(10*1) + (5*2) + (7*3) + (5*4) = 61
I tried this:
val myDF1 = myDF.withColumn("mySum", {var a:Double = 0.0; for ((k,v) <- coefficients) a + (col(k).cast(DoubleType)*coefficients(k));a})
but got an error that the "+" method was overloaded. Even if I solved that, I'm not sure this would work. Any ideas? I could always dynamically build a SQL query as text string and do it that way but I was hoping for something a little more eloquent.
Any ideas are appreciated.
Problem with your code is that you try to add a Column to Double. cast(DoubleType) affects only a type of stored value, not a type of column itself. Since Double doesn't provide *(x: org.apache.spark.sql.Column): org.apache.spark.sql.Column method everything fails.
To make it work you can for example do something like this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col, lit}
val df = sc.parallelize(Seq(
(123, 10, 5, 7, 5), (456, 1, 1, 1, 1)
)).toDF("k", "val1", "val2", "val3", "val4")
val coefficients = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
val dotProduct: Column = coefficients
// To be explicit you can replace
// col(k) * v with col(k) * lit(v)
// but it is not required here
// since we use * f Column.* method not Int.*
.map{ case (k, v) => col(k) * v } // * -> Column.*
.reduce(_ + _) // + -> Column.+
df.withColumn("mySum", dotProduct).show
// +---+----+----+----+----+-----+
// | k|val1|val2|val3|val4|mySum|
// +---+----+----+----+----+-----+
// |123| 10| 5| 7| 5| 61|
// |456| 1| 1| 1| 1| 10|
// +---+----+----+----+----+-----+
It looks like the issue is that you aren't actually doing anything with a
for((k, v) <- coefficients) a + ...
You probably meant a += ...
Also, some advice for cleaning up the block of code inside the withColumn call:
You don't need to call coefficients(k) because you've already got its value in v from for((k,v) <- coefficients)
Scala is pretty good at making one-liners, but it's kinda cheating if you have to put semicolons in that one line :P I'd suggest breaking up the sum calculation section into one line per expression.
The sum expression could be rewritten as a fold which avoids using a var (idiomatic Scala usually avoids vars), e.g.
import org.apache.spark.sql.functions.lit
coefficients.foldLeft(lit(0.0)){
case (sumSoFar, (k,v)) => col(k).cast(DoubleType) * v + sumSoFar
}
I'm not sure if this is possible through the DataFrame API since you are only able to work with columns and not any predefined closures (e.g. your parameter map).
I've outlined a way below using the underlying RDD of the DataFrame:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Initializing your input example.
val df1 = sc.parallelize(Seq((123, 10, 5, 7, 5))).toDF("id", "val1", "val2", "val3", "val4")
// Return column names as an array
val names = df1.columns
// Grab underlying RDD and zip elements with column names
val rdd1 = df1.rdd.map(row => (0 until row.length).map(row.getInt(_)).zip(names))
// Tack on accumulated total to the existing row
val rdd2 = rdd0.map { seq => Row.fromSeq(seq.map(_._1) :+ seq.map { case (value: Int, name: String) => value * coefficents.getOrElse(name, 0) }.sum) }
// Create output schema (with total)
val totalSchema = StructType(df1.schema.fields :+ StructField("total", IntegerType))
// Apply schema to create output dataframe
val df2 = sqlContext.createDataFrame(rdd1, totalSchema)
// Show output:
df2.show()
...
+---+----+----+----+----+-----+
| id|val1|val2|val3|val4|total|
+---+----+----+----+----+-----+
|123| 10| 5| 7| 5| 61|
+---+----+----+----+----+-----+