How to retain the column structure of a Spark Dataframe following a map operation on rows - scala

I am trying to apply a function to each row of a Spark DataFrame, as in the example.
val df = sc.parallelize(
Seq((1, 2, 0), (0, 0, 1), (0, 0, 0))).toDF("x", "y", "z")
df.show()
which yields
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Suppose I want to do something to the values in each row, for example changing 0 to 5.
val b = df.map(row => row.toSeq.map(x => x match{
case 0 => 5
case x: Int => x
}))
b.show()
+---------+
| value|
+---------+
|[1, 2, 5]|
|[5, 5, 1]|
|[5, 5, 5]|
+---------+
The function worked, but I now have one column whose entries are Lists, instead of 3 columns of Ints. I would like my named columns back.

You can define an UDF to apply this substitution. For example:
def subsDef(k: Int): Int = if(k==0) 5 else k
val subs = udf[Int, Int](subsDef)
Then you can apply the UDF to a specific column or, if you desire, to every columns of the DF:
// to a single column, for example "x"
df = df.withColumn("x", subs(col("x")))
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 5| 0| 1|
| 5| 0| 0|
+---+---+---+
// to every columns of DF
df.columns.foreach(c => {
df = df.withColumn(c, subs(col(c)))
})
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 5|
| 5| 5| 1|
| 5| 5| 5|
+---+---+---+

Rather than transforming the DataFrame row-wise, consider using built-in Spark API function when/otherwise, as follows:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((1, 2, 0), (0, 0, 1), (0, 0, 0)).toDF("x", "y", "z")
val vFrom = 0
val vTo = 5
val cols = df.columns // Filter for specific columns if necessary
df.select( cols.map( c =>
when(col(c) === vFrom, vTo).otherwise(col(c)).as(c)
): _*
).show
// +---+---+---+
// | x| y| z|
// +---+---+---+
// | 1| 2| 5|
// | 5| 5| 1|
// | 5| 5| 5|
// +---+---+---+

There are various ways to do it here are some:
df.map(row => {
val size = row.size
var seq: Seq[Int] = Seq.empty[Int]
for (a <- 0 to size - 1) {
val value: Int = row(a).asInstanceOf[Int]
val newVal: Int = value match {
case 0 =>
5
case _ =>
value
}
seq = seq :+ newVal
}
Row.fromSeq(seq)
})(RowEncoder.apply(df.schema))
val columns = df.columns
df.select(
columns.map(c => when(col(c) === 0, 5).otherwise(col(c)).as(c)): _*)
.show()
def fun: (Int => Int) = { x =>
if (x == 0) 5 else x
}
val function = udf(fun)
df.select(function(col("x")).as("x"),
function(col("y")).as("y"),
function(col("z")).as("z"))
.show()
def checkZero(a: Int): Int = if (a == 0) 5 else a
df.map {
case Row(a: Int, b: Int, c: Int) =>
Row(checkZero(a), checkZero(b), checkZero(c))
} { RowEncoder.apply(df.schema) }
.show()

Related

Adding a count column to my sequence in Scala

Given the code below, how would I go about adding a count column? (e.g. .count("*").as("count"))
Final output to look like something like this:
+---+------+------+-----------------------------+------
| id|sum(d)|max(b)|concat_ws(,, collect_list(s))|count|
+---+------+------+-----------------------------+------
| 1| 1.0| true| a. | 1 |
| 2| 4.0| true| b,b| 2 |
| 3| 3.0| true| c. | 1 |
Current code is below:
val df =Seq(
(1, 1.0, true, "a"),
(2, 2.0, false, "b")
(3, 3.0, false, "b")
(2, 2.0, false, "c")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf => (sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c))) // "append"
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id")
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(aggExprs.head,aggExprs.tail:_*)
.show()
You can simply append count("*").as("count") to aggExprs.tail in your agg, as shown below:
df.
groupBy("id").agg(aggExprs.head, aggExprs.tail :+ count("*").as("count"): _*).
show
// +---+------+------+-----------------------------+-----+
// | id|sum(d)|max(b)|concat_ws(,, collect_list(s))|count|
// +---+------+------+-----------------------------+-----+
// | 1| 1.0| true| a| 1|
// | 3| 3.0| false| b| 1|
// | 2| 4.0| false| b,c| 2|
// +---+------+------+-----------------------------+-----+

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

Spark Scala Split DataFrame by some value range

Suppose I have a dataframe with a column named x with a value range of [0, 1]. I hope to split it by the value of column x with ranges like [0, 0.1), [0.1, 0.2)...[0.9, 1]. Is there a good and fast way to do that? I'm using Spark 2 in Scala.
Update: Ideally there should be 10 new dataframes that contain data for each range.
Expanding on #Psidom's solution for creating ranges, here's one approach to create a dataframe for each range:
import org.apache.spark.sql.types.IntegerType
val df = Seq(0.2, 0.71, 0.95, 0.33, 0.28, 0.8, 0.73).toDF("x")
val df2 = df.withColumn("g", ($"x" * 10.0).cast(IntegerType))
df2.show
+----+---+
| x| g|
+----+---+
| 0.2| 2|
|0.71| 7|
|0.95| 9|
|0.33| 3|
|0.28| 2|
| 0.8| 8|
|0.73| 7|
+----+---+
val dfMap = df2.select($"g").distinct.
collect.
flatMap(_.toSeq).
map( g => g -> df2.where($"g" === g) ).
toMap
dfMap.getOrElse(3, null).show
+----+---+
| x| g|
+----+---+
|0.33| 3|
+----+---+
dfMap.getOrElse(7, null).show
+----+---+
| x| g|
+----+---+
|0.71| 7|
|0.73| 7|
+----+---+
[UPDATE]
If your ranges are irregular, you can define a function which maps a Double into the corresponding Int range id, then wrap it with a UDF, like in the following:
val g: Double => Int = x => x match {
case x if (x >= 0.0 && x < 0.12345) => 1
case x if (x >= 0.12345 && x < 0.4834) => 2
case x if (x >= 0.4834 && x < 1.0) => 3
case _ => 99 // catch-all
}
val groupUDF = udf(g)
val df = Seq(0.1, 0.2, 0.71, 0.95, 0.03, 0.09, 0.44, 5.0).toDF("x")
val df2 = df.withColumn("g", groupUDF($"x"))
df2.show
+----+---+
| x| g|
+----+---+
| 0.1| 1|
| 0.2| 2|
|0.71| 3|
|0.95| 3|
|0.03| 1|
|0.09| 1|
|0.44| 2|
| 5.0| 99|
+----+---+
If you meant to discretize a double typed column, you might just do this (multiply the column by 10 and then cast it to integer type, the column will be cut into 10 discrete bins):
import org.apache.spark.sql.types.IntegerType
val df = Seq(0.32, 0.5, 0.99, 0.72, 0.11, 0.03).toDF("A")
// df: org.apache.spark.sql.DataFrame = [A: double]
df.withColumn("new", ($"A" * 10).cast(IntegerType)).show
+----+---+
| A|new|
+----+---+
|0.32| 3|
| 0.5| 5|
|0.99| 9|
|0.72| 7|
|0.11| 1|
|0.03| 0|
+----+---+

Fill Nan with mean of the row in Scala-Spark

I have an RDD with 6 columns, where the last 5 columns might contain NaNs. My intention is to replace the NaNs with the average value of the rest of the last 5 values of the row which are not Nan. For instance, having this input:
1, 2, 3, 4, 5, 6
2, 2, 2, NaN, 4, 0
3, NaN, NaN, NaN, 6, 0
4, NaN, NaN, 4, 4, 0
The output should be:
1, 2, 3, 4, 5, 6
2, 2, 2, 2, 4, 0
3, 3, 3, 3, 6, 0
4, 3, 3, 4, 4, 0
I know how to fill those NaNs with the average value of the column transforming the RDD to DataFrame:
var aux1 = df.select(df.columns.map(c => mean(col(c))) :_*)
var aux2 = df.na.fill(/*get values of aux1*/)
My question is, how can you do this operation but instead of filling the NaN with the column average, fill it with an average of the values of a subgroup of the row?
You can do this by defining a function to get the mean, and another function to fill nulls in a row.
Given the DF you presented:
val df = sc.parallelize(List((Some(1),Some(2),Some(3),Some(4),Some(5),Some(6)),(Some(2),Some(2),Some(2),None,Some(4),Some(0)),(Some(3),None,None,None,Some(6),Some(0)),(Some(4),None,None,Some(4),Some(4),Some(0)))).toDF("a","b","c","d","e","f")
We need a function to get the mean of a Row:
import org.apache.spark.sql.Row
def rowMean(row: Row): Int = {
val nonNulls = (0 until row.length).map(i => (!row.isNullAt(i), row.getAs[Int](i))).filter(_._1).map(_._2).toList
nonNulls.sum / nonNulls.length
}
And another to fill nulls in a Row:
def rowFillNulls(row: Row, fill: Int): Row = {
Row((0 until row.length).map(i => if (row.isNullAt(i)) fill else row.getAs[Int](i)) : _*)
}
Now we can first compute each row mean:
val rowWithMean = df.map(row => (row,rowMean(row)))
And then fill it:
val result = sqlContext.createDataFrame(rowWithMean.map{case (row,mean) => rowFillNulls(row,mean)}, df.schema)
Finally view before and after...
df.show
+---+----+----+----+---+---+
| a| b| c| d| e| f|
+---+----+----+----+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2|null| 4| 0|
| 3|null|null|null| 6| 0|
| 4|null|null| 4| 4| 0|
+---+----+----+----+---+---+
result.show
+---+---+---+---+---+---+
| a| b| c| d| e| f|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2| 2| 4| 0|
| 3| 3| 3| 3| 6| 0|
| 4| 3| 3| 4| 4| 0|
+---+---+---+---+---+---+
This will work for any width DF with Int columns. You can easily update this to other datatypes, even non-numeric (hint, inspect the df schema!)
A bunch of imports:
import org.apache.spark.sql.functions.{col, isnan, isnull, round, when}
import org.apache.spark.sql.Column
A few helper functions:
def nullOrNan(c: Column) = isnan(c) || isnull(c)
def rowMean(cols: Column*): Column = {
val sum = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(c))
.fold(lit(0.0))(_ + _)
val count = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(lit(1.0)))
.fold(lit(0.0))(_ + _)
sum / count
}
A solution:
val mean = round(
rowMean(df.columns.tail.map(col): _*)
).cast("int").alias("mean")
val exprs = df.columns.tail.map(
c => when(nullOrNan(col(c)), mean).otherwise(col(c)).alias(c)
)
val filled = df.select(col(df.columns(0)) +: exprs: _*)
Well, this is a fun little problem - I will post my solution, but I will definitely watch and see if someone comes up with a better way of doing it :)
First I would introduce a couple of udfs:
val avg = udf((values: Seq[Integer]) => {
val notNullValues = values.filter(_ != null).map(_.toInt)
notNullValues.sum/notNullValues.length
})
val replaceNullWithAvg = udf((x: Integer, avg: Integer) => if(x == null) avg else x)
which I would then apply to the DataFrame like this:
dataframe
.withColumn("avg", avg(array(df.columns.tail.map(s => df.col(s)):_*)))
.select('col1, replaceNullWithAvg('col2, 'avg) as "col2", replaceNullWithAvg('col3, 'avg) as "col3", replaceNullWithAvg('col4, 'avg) as "col4", replaceNullWithAvg('col5, 'avg) as "col5", replaceNullWithAvg('col6, 'avg) as "col6")
This will get you what you are looking for, but arguably not the most sophisticated code I have ever put together...

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
(0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)
How can I map over the array and get my output?
You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)
aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+
First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults = combined.rdd.map{
x =>
(x.getInt(0), x.toSeq.tail.map(_.asInstanceOf[Double]))
}.map{
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual(g.map(x => x(0)))
val g2 = checkAllEqual(g.map(x => x(1)))
val g3 = checkAllEqual(g.map(x => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
finalResults.show()
This will print:
+---+-----+-----+-----+
| id|prod1|prod2|prod3|
+---+-----+-----+-----+
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|
+---+-----+-----+-----+