How to unpivot Spark DataFrame without hardcoding column names in Scala? - scala

Assume you have
val df = Seq(("Jack", 91, 86), ("Mike", 79, 85), ("Julia", 93, 70)).toDF("Name", "Maths", "Art")
which gives:
+-----+-----+---+
| Name|Maths|Art|
+-----+-----+---+
| Jack| 91| 86|
| Mike| 79| 85|
|Julia| 93| 70|
+-----+-----+---+
Now you want to unpivot it by:
df.select($"Name", expr("stack(2, 'Maths', Maths, 'Art', Art) as (Subject, Score)"))
which gives:
+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
| Jack| Maths| 91|
| Jack| Art| 86|
| Mike| Maths| 79|
| Mike| Art| 85|
|Julia| Maths| 93|
|Julia| Art| 70|
+-----+-------+-----+
So far so godd! Now, what if you don't know the list of column names? What if the list of column names is long or it can change? How can we avoid hardcoding the column names stupidly like that?
Or even something like this is also good:
// fake code
df.select($"Name", unpivot(df.columns.diff("Name")) as ("Subject", "Score"))
Why don't we have api like this?

This works quite well indeed:
def melt(preserves: Seq[String], toMelt: Seq[String], column: String = "variable", row: String = "value", df: DataFrame) : DataFrame = {
val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
val cols = preserves.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
_tmp.select(cols: _*)
}
Source: How to melt Spark DataFrame?
Thanks to #user10938362

By making use of .mkString multi delimiter we can create expression and use it in expr.
Example:
df.show()
//+-----+-----+---+
//| Name|Maths|Art|
//+-----+-----+---+
//| Jack| 91| 86|
//| Mike| 79| 85|
//|Julia| 93| 70|
//+-----+-----+---+
//filtering required cols
val cols=df.columns.filter(_.toLowerCase != "name")
//defining alias cols string
val alias_cols="Subject,Score"
//mkString with 3 seperators
val stack_exp=cols.map(x => s"""'${x}',${x}""").mkString(s"stack(${cols.size},",",",s""") as (${alias_cols})""")
df.select($"Name", expr(s"""${stack_exp}""")).show()
//+-----+-------+-----+
//| Name|Subject|Score|
//+-----+-------+-----+
//| Jack| Maths| 91|
//| Jack| Art| 86|
//| Mike| Maths| 79|
//| Mike| Art| 85|
//|Julia| Maths| 93|
//|Julia| Art| 70|
//+-----+-------+-----+

Related

How can I add sequence of string as column on dataFrame and make as transforms

I have a sequence of string
val listOfString : Seq[String] = Seq("a","b","c")
How can I make a transform like
def addColumn(example: Seq[String]): DataFrame => DataFrame {
some code which returns a transform which add these String as column to dataframe
}
input
+-------
| id
+-------
| 1
+-------
output
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 0 | 0 | 0
+-------+-------+----+-------
I am only interested in making it as transform
You can use the transform method of the datasets together with a single select statement:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
def addColumns(extraCols: Seq[String])(df: DataFrame): DataFrame = {
val selectCols = df.columns.map{col(_)} ++ extraCols.map{c => lit(0).as(c)}
df.select(selectCols :_*)
}
// usage example
val yourExtraColumns : Seq[String] = Seq("a","b","c")
df.transform(addColumns(yourExtraColumns))
Resources
https://towardsdatascience.com/dataframe-transform-spark-function-composition-eb8ec296c108
https://mungingdata.com/apache-spark/chaining-custom-dataframe-transformations/
Use .toDF() and pass your listOfString.
Example:
//sample dataframe
df.show()
//+---+---+---+
//| _1| _2| _3|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
df.toDF(listOfString:_*).show()
//+---+---+---+
//| a| b| c|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
UPDATE:
Use foldLeft to add the columns to the existing dataframe with values.
val df=Seq(("1")).toDF("id")
val listOfString : Seq[String] = Seq("a","b","c")
val new_df=listOfString.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+
//or creating a function
import org.apache.spark.sql.DataFrame
def addColumns(extraCols: Seq[String],df: DataFrame): DataFrame = {
val new_df=extraCols.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
return new_df
}
addColumns(listOfString,df).show()
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+

Infinity value in a dataframe spark / scala

I have a dataframe with infinity values . How can I replace it by 0.0.
I tried that but thoesn't work .
val Nan= dataframe_final.withColumn("Vitesse",when(col("Vitesse").isin(Double.NaN,Double.PositiveInfinity,Double.NegativeInfinity),0.0))
Example of dataframe
--------------------
| Vitesse |
--------------------
| 8.171069002316942|
| Infinity |
| 4.290418664272539|
|16.19811830014666 |
| |
How can I replace "Infinity by 0.0" ?
Thank you.
scala> df.withColumn("Vitesse", when(col("Vitesse").equalTo(Double.PositiveInfinity),0.0).otherwise(col("Vitesse")))
res1: org.apache.spark.sql.DataFrame = [Vitesse: double]
scala> res1.show
+-----------------+
| Vitesse|
+-----------------+
|8.171069002316942|
| 0.0|
|4.290418664272539|
+-----------------+
You can try like above.
Your approach is correct using when() .otherwise()
add the missing otherWise statement to get Vitesse values as is if value is not in Infinity,-Infinity,NaN.
Example:
val df=Seq(("8.171".toDouble),("4.2904".toDouble),("16.19".toDouble),(Double.PositiveInfinity),(Double.NegativeInfinity),(Double.NaN)).toDF("Vitesse")
df.show()
//+---------+
//| Vitesse|
//+---------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| Infinity|
//|-Infinity|
//| NaN|
//+---------+
df.withColumn("Vitesse", when(col("Vitesse").isin(Double.PositiveInfinity,Double.NegativeInfinity,Double.NaN),0.0).
otherwise(col("Vitesse"))).
show()
//+-------+
//|Vitesse|
//+-------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| 0.0|
//| 0.0|
//| 0.0|
//+-------+

Updating a column of a dataframe based on another column value

I'm trying to update the value of a column using another column's value in Scala.
This is the data in my dataframe :
+-------------------+------+------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+------+------+-----+------+----+--------------------+-----------+
| 1| 0| 0| Name| 0|Desc| | 0|
| 2| 2.11| 10000|Juice| 0| XYZ|2016/12/31 : Inco...| 0|
| 3|-0.500|-24.12|Fruit| -255| ABC| 1994-11-21 00:00:00| 0|
| 4| 0.087| 1222|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5| 0.087| 1222|Bread|-22.06| | | 0|
+-------------------+------+------+-----+------+----+--------------------+-----------+
Here column _c5 contains a value which is incorrect(value in Row2 has the string Incorrect) based on which I'd like to update its isBadRecord field to 1.
Is there a way to update this field?
You can use withColumn api and use one of the functions which meet your needs to fill 1 for bad record.
For your case you can write a udf function
def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
And call it as
val newDF = dataframe.withColumn("isBadRecord", fillbad(dataframe("_c5")))
Instead of reasoning about updating it, I'll suggest you think about it as you would in SQL; you could do the following:
import org.spark.sql.functions.when
val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe
import spark.implicits._
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
Here is a self-contained script that you can copy and paste on your Spark shell to see the result locally:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
sc.setLogLevel("ERROR")
val schema =
StructType(Seq(
StructField("UniqueRowIdentifier", IntegerType),
StructField("_c0", DoubleType),
StructField("_c1", DoubleType),
StructField("_c2", StringType),
StructField("_c3", DoubleType),
StructField("_c4", StringType),
StructField("_c5", StringType),
StructField("isBadRecord", IntegerType)))
val contents =
Seq(
Row(1, 0.0 , 0.0 , "Name", 0.0, "Desc", "", 0),
Row(2, 2.11 , 10000.0 , "Juice", 0.0, "XYZ", "2016/12/31 : Incorrect", 0),
Row(3, -0.5 , -24.12, "Fruit", -255.0, "ABC", "1994-11-21 00:00:00", 0),
Row(4, 0.087, 1222.0 , "Bread", -22.06, "", "2017-02-14 00:00:00", 0),
Row(5, 0.087, 1222.0 , "Bread", -22.06, "", "", 0)
)
val df = spark.createDataFrame(sc.parallelize(contents), schema)
df.show()
val withBadRecords =
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
withBadRecords.show()
Whose relevant output is the following:
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 0|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 1|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
The best option is to create a UDF and try to convert it do Date format.
If it can be converted then return 0 else return 1
This work even if you have an bad date format
val spark = SparkSession.builder().master("local")
.appName("test").getOrCreate()
import spark.implicits._
//create test dataframe
val data = spark.sparkContext.parallelize(Seq(
(1,"1994-11-21 Xyz"),
(2,"1994-11-21 00:00:00"),
(3,"1994-11-21 00:00:00")
)).toDF("id", "date")
// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure
val check = udf((value: String) => {
Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
case Success(d) => 1
case Failure(e) => 0
}
})
// Add column
data.withColumn("badData", check($"date")).show
Hope this helps!

Combining RDD's with some values missing

Hi I have two RDD's I want to combine into 1.
The first RDD is of the format
//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
I have another RDD
//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))
So the keys in the second RDD are more in no. but overlap with RDD1. I need to combine the RDD's so that only those keys of 2nd RDD append to RDD1 which are not there in RDD1.
You can do something like this -
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)
val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))
val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))
val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)
// Convert to Dataframe so it is easier to handle
val df1 = rdd1.toDF
val df2 = rdd2.toDF
// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
+----+---+----+
df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| -1|
| U5| M4| 3|
| U6| M6| -1|
| U3| M2| -1|
| U4| M2| -1|
| U4| M3| 5|
+----+---+----+
// Figure out the extra reviews in second dataframe that do not match (user, mov) in first
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")
// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
// Final result of combining only unique values in df2
unionByName(df1, xtraReviews).show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
| U5| M4| 3|
| U4| M3| 5|
| U6| M6| -1|
+----+---+----+
It might also be possible to do it in this way:
RDD's are really slow, so read your data or convert your data in dataframes.
Use spark dropDuplicates() on both the dataframes like df.dropDuplicates(['Key1', 'Key2']) to get distinct values on keys in both of your dataframe and then
simply union them like df1.union(df2).
Benefit is you are doing it in spark way and hence you have all the parallelism and speed.

Dataframe.map need to result with more than the rows in dataset

I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
ex:
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
eg.
df1 = original dataframe
df2 = transformed df1
df1.unionAll(df2)
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
where("age!=age2").
drop("age2")
df1.withColumn("age", udfTransform($"age")).
unionAll(df2).
orderBy("id").
show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
+---+-----+---+
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
df2.show()
+---+-----+--------+
| id| name| age|
+---+-----+--------+
| 1| john|[23, 25]|
| 2|peter| [32]|
+---+-----+--------+
df2.withColumn("age",explode($"age") ).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|
+---+-----+---+