SumProduct in Spark DataFrame - scala

I want to create essentially a sumproduct across columns in a Spark DataFrame. I have a DataFrame that looks like this:
id val1 val2 val3 val4
123 10 5 7 5
I also have a Map that looks like:
val coefficents = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
I want to take the value in each column of the DataFrame, multiply it by the corresponding value from the map, and return the result in a new column so essentially:
(10*1) + (5*2) + (7*3) + (5*4) = 61
I tried this:
val myDF1 = myDF.withColumn("mySum", {var a:Double = 0.0; for ((k,v) <- coefficients) a + (col(k).cast(DoubleType)*coefficients(k));a})
but got an error that the "+" method was overloaded. Even if I solved that, I'm not sure this would work. Any ideas? I could always dynamically build a SQL query as text string and do it that way but I was hoping for something a little more eloquent.
Any ideas are appreciated.

Problem with your code is that you try to add a Column to Double. cast(DoubleType) affects only a type of stored value, not a type of column itself. Since Double doesn't provide *(x: org.apache.spark.sql.Column): org.apache.spark.sql.Column method everything fails.
To make it work you can for example do something like this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col, lit}
val df = sc.parallelize(Seq(
(123, 10, 5, 7, 5), (456, 1, 1, 1, 1)
)).toDF("k", "val1", "val2", "val3", "val4")
val coefficients = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
val dotProduct: Column = coefficients
// To be explicit you can replace
// col(k) * v with col(k) * lit(v)
// but it is not required here
// since we use * f Column.* method not Int.*
.map{ case (k, v) => col(k) * v } // * -> Column.*
.reduce(_ + _) // + -> Column.+
df.withColumn("mySum", dotProduct).show
// +---+----+----+----+----+-----+
// | k|val1|val2|val3|val4|mySum|
// +---+----+----+----+----+-----+
// |123| 10| 5| 7| 5| 61|
// |456| 1| 1| 1| 1| 10|
// +---+----+----+----+----+-----+

It looks like the issue is that you aren't actually doing anything with a
for((k, v) <- coefficients) a + ...
You probably meant a += ...
Also, some advice for cleaning up the block of code inside the withColumn call:
You don't need to call coefficients(k) because you've already got its value in v from for((k,v) <- coefficients)
Scala is pretty good at making one-liners, but it's kinda cheating if you have to put semicolons in that one line :P I'd suggest breaking up the sum calculation section into one line per expression.
The sum expression could be rewritten as a fold which avoids using a var (idiomatic Scala usually avoids vars), e.g.
import org.apache.spark.sql.functions.lit
coefficients.foldLeft(lit(0.0)){
case (sumSoFar, (k,v)) => col(k).cast(DoubleType) * v + sumSoFar
}

I'm not sure if this is possible through the DataFrame API since you are only able to work with columns and not any predefined closures (e.g. your parameter map).
I've outlined a way below using the underlying RDD of the DataFrame:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Initializing your input example.
val df1 = sc.parallelize(Seq((123, 10, 5, 7, 5))).toDF("id", "val1", "val2", "val3", "val4")
// Return column names as an array
val names = df1.columns
// Grab underlying RDD and zip elements with column names
val rdd1 = df1.rdd.map(row => (0 until row.length).map(row.getInt(_)).zip(names))
// Tack on accumulated total to the existing row
val rdd2 = rdd0.map { seq => Row.fromSeq(seq.map(_._1) :+ seq.map { case (value: Int, name: String) => value * coefficents.getOrElse(name, 0) }.sum) }
// Create output schema (with total)
val totalSchema = StructType(df1.schema.fields :+ StructField("total", IntegerType))
// Apply schema to create output dataframe
val df2 = sqlContext.createDataFrame(rdd1, totalSchema)
// Show output:
df2.show()
...
+---+----+----+----+----+-----+
| id|val1|val2|val3|val4|total|
+---+----+----+----+----+-----+
|123| 10| 5| 7| 5| 61|
+---+----+----+----+----+-----+

Related

Spark Scala - drop the first element from the array in dataframe

I have a following dataframe
+--------------------+
| values |
+--------------------+
|[[1,1,1],[3,2,4],[1,|
|[[1,1,2],[2,2,4],[1,|
|[[1,1,3],[4,2,4],[1,|
I want a column with the tail of the list. So far I know how to select the first element
val df1 = df.select("values").getItem(0) , but is there a method which would allow me drop the first element ?
A UDF with a simple size check seems to be the simplest solution:
val df = Seq((1, Seq(1, 2, 3)), (2, Seq(4, 5))).toDF("c1", "c2")
def tail = udf( (s: Seq[Int]) => if (s.size > 1) s.tail else Seq.empty[Int] )
df.select($"c1", tail($"c2").as("c2tail")).show
// +---+------+
// | c1|c2tail|
// +---+------+
// | 1|[2, 3]|
// | 2| [5]|
// +---+------+
As per suggestion in the comment section, a preferred solution would be to use Spark built-in function slice:
df.select($"c1", slice($"c2", 2, Int.MaxValue).as("c2tail"))
I don't think exists a built-in operator for this.
But you can use UDFs, for example:
import collection.mutable.WrappedArray
def tailUdf = udf((array: WrappedArray[WrappedArray[Int]])=> array.tail)
df.select(tailUdf(col("value"))).show()

How to extract efficiently multiple columns from a single string column RDD?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})

Calculating edit distance on successive rows of a `Spark Dataframe

I have a data frame as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
// some data...
val df = Seq(
(1, "AA", "BB", ("AA", "BB")),
(2, "AA", "BB", ("AA", "BB")),
(3, "AB", "BB", ("AB", "BB"))
).toDF("id","name", "surname", "array")
df.show()
and i am looking to calculate the edit distance between the 'array' column in successive row. As an example i want to calculate the edit distance between the 'array' entity in column 1 ("AA", "BB") and the the 'array' entity in column 2 ("AA", "BB"). Here is the edit distance function i am using:
def editDist2[A](a: Iterable[A], b: Iterable[A]): Int = {
val startRow = (0 to b.size).toList
a.foldLeft(startRow) { (prevRow, aElem) =>
(prevRow.zip(prevRow.tail).zip(b)).scanLeft(prevRow.head + 1) {
case (left, ((diag, up), bElem)) => {
val aGapScore = up + 1
val bGapScore = left + 1
val matchScore = diag + (if (aElem == bElem) 0 else 1)
List(aGapScore, bGapScore, matchScore).min
}
}
}.last
}
I know i need to create a UDF for this function but can't seem to be able to. If i use the function as is and using Spark Windowing to get at the pervious row:
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
i get the following error:
<console>:245: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Iterable[?]
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
I figured out you can use Spark's own levenshtein function for this. This function takes in two string to compare, so it can't be used with the array.
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", levenshtein(($"name"), lag("name", 1).over(window)) + levenshtein(($"surname"), lag("surname", 1).over(window))).show()
giving the desired output:
+---+----+-------+--------+------+
| id|name|surname| array|edit-d|
+---+----+-------+--------+------+
| 1| AA| BB|[AA, BB]| null|
| 2| AA| BB|[AA, BB]| 0|
| 3| AB| BB|[AB, BB]| 1|
+---+----+-------+--------+------+

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

Spark withColumn - add column using non-Column type variable [duplicate]

This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 4 years ago.
How can I add a column to a data frame from a variable value?
I know that I can create a data frame using .toDF(colName) and that .withColumn is the method to add the column. But, when I try the following, I get a type mismatch error:
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", myArray)
Type mismatch, expected: Column, actual: Array[Int]
This compile error is on myArray within the .withColumn call. How can I convert it from an Array[Int] to a Column type?
The error message has exactly what is up, you need to input a column (or a lit()) as the second argument as withColumn()
try this
import org.apache.spark.sql.functions.typedLit
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", typedLit(myArray))
:)
Not sure withColumn is what you're actually seeking. You could apply lit() to make myArray conform to the method specs, but the result will be the same array value for every row in the DataFrame:
myList.toDF("myList").withColumn("myArray", lit(myArray)).
show
// +------+---------+
// |myList| myArray|
// +------+---------+
// | 1|[1, 2, 3]|
// | 2|[1, 2, 3]|
// | 3|[1, 2, 3]|
// +------+---------+
If you're trying to merge the two collections column-wise, it's a different transformation from what withColumn offers. In that case you'll need to convert each of them into a DataFrame and combine them via a join.
Now if the elements of the two collections are row-identifying and match each other pair-wise like in your example and you want to join them that way, you can simply join the converted DataFrames:
myList.toDF("myList").join(
myArray.toSeq.toDF("myArray"), $"myList" === $"myArray"
).show
// +------+-------+
// |myList|myArray|
// +------+-------+
// | 1| 1|
// | 2| 2|
// | 3| 3|
// +------+-------+
But in case the two collections have elements that aren't join-able and you simply want to merge them column-wise, you'll need to use compatible row-identifying columns from the two dataframes to join them. And if there isn't such row-identifying columns, one approach would be to create your own rowIds, as in the following example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val df1 = List("a", "b", "c").toDF("myList")
val df2 = Array("x", "y", "z").toSeq.toDF("myArray")
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1withId = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2withId = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
df1withId.join(df2withId, Seq("rowId")).show
// +-----+------+-------+
// |rowId|myList|myArray|
// +-----+------+-------+
// | 0| a| x|
// | 1| b| y|
// | 2| c| z|
// +-----+------+-------+