How can I add sequence of string as column on dataFrame and make as transforms - scala

I have a sequence of string
val listOfString : Seq[String] = Seq("a","b","c")
How can I make a transform like
def addColumn(example: Seq[String]): DataFrame => DataFrame {
some code which returns a transform which add these String as column to dataframe
}
input
+-------
| id
+-------
| 1
+-------
output
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 0 | 0 | 0
+-------+-------+----+-------
I am only interested in making it as transform

You can use the transform method of the datasets together with a single select statement:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
def addColumns(extraCols: Seq[String])(df: DataFrame): DataFrame = {
val selectCols = df.columns.map{col(_)} ++ extraCols.map{c => lit(0).as(c)}
df.select(selectCols :_*)
}
// usage example
val yourExtraColumns : Seq[String] = Seq("a","b","c")
df.transform(addColumns(yourExtraColumns))
Resources
https://towardsdatascience.com/dataframe-transform-spark-function-composition-eb8ec296c108
https://mungingdata.com/apache-spark/chaining-custom-dataframe-transformations/

Use .toDF() and pass your listOfString.
Example:
//sample dataframe
df.show()
//+---+---+---+
//| _1| _2| _3|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
df.toDF(listOfString:_*).show()
//+---+---+---+
//| a| b| c|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
UPDATE:
Use foldLeft to add the columns to the existing dataframe with values.
val df=Seq(("1")).toDF("id")
val listOfString : Seq[String] = Seq("a","b","c")
val new_df=listOfString.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+
//or creating a function
import org.apache.spark.sql.DataFrame
def addColumns(extraCols: Seq[String],df: DataFrame): DataFrame = {
val new_df=extraCols.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
return new_df
}
addColumns(listOfString,df).show()
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+

Related

Selecting specific rows from different dataframes within a map scope

Hello I am new to Spark and scala, and I have three similar dataframes as the following:
df1:
+--------+-------+-------+-------+
| Country|1/22/20|1/23/20|1/24/20|
+--------+-------+-------+-------+
| Chad| 1| 0| 5|
+--------+-------+-------+-------+
|Paraguay| 4| 6| 3|
+--------+-------+-------+-------+
| Russia| 0| 0| 1|
+--------+-------+-------+-------+
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using
df1.map{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
}
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.
df1.map{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(s.name, s"$key${s.name}")
})
newdf
}
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.drop("key2_country")
.drop("key3_country")
.withColumnRenamed("key1_country", "country")
finaldf.show(10)
//Output
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

Scala Adding many column in dataFrame

private def example(channelRollups: Seq[String]): Seq[(String, String)]={
seqOfElement.flatMap(first => seqOfElement.
filter(second => first.toLowerCase() < second.toLowerCase()).
map((first, _)))
}
def addProductCol(df: DataFrame, cols: (String, String)): DataFrame = {
df.withColumn(s"${cols._1}_${cols._2}", df(cols._1) * df(cols._2))
}
def addAllProductCols(df: DataFrame): DataFrame = {
colPairs = example(exampleSeq)
colPairs.foldLeft(df)(addProductCol)
}
Output
colPairs = Seq(("left", "right"), ("middle", "right"))
+----+------+-----+
|left|middle|right|
+----+------+-----+
| 2| 3| 5|
| 7| 11| 13|
| 17| 19| 23|
+----+------+-----+
+----+------+-----+----------+------------+
|left|middle|right|left_right|middle_right|
+----+------+-----+----------+------------+
| 2| 3| 5| 10| 15|
| 7| 11| 13| 91| 143|
| 17| 19| 23| 391| 437|
+----+------+-----+----------+------------+
Instead of directly adding column into DataFrame , Can we modify into DataFrame => DataFrame or we can say a transform??
def example2 : DataFrame => DataFrame = {
dataFrame => {
dataFrame.
withColumn(Column, when(col(ORDER_DATETIME_KEY).isNotNull, 1).otherwise(0))
}
}
So basically I want to make the first example like the exmaple2 in which I will return dataType as DataFrame => DataFrame not as DataFrame

Converting row into column using spark scala

I want to covert row into column using spark dataframe.
My table is like this
Eno,Name
1,A
1,B
1,C
2,D
2,E
I want to convert it into
Eno,n1,n2,n3
1,A,B,C
2,D,E,Null
I used this below code :-
val r = spark.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load("C:\\Users\\axy\\Desktop\\abc2.csv")
val n =Seq("n1","n2","n3"
r
.groupBy("Eno")
.pivot("Name",n).agg(expr("coalesce(first(Name),3)").cast("double")).show()
But I am getting result as-->
+---+----+----+----+
|Eno| n1| n2| n3|
+---+----+----+----+
| 1|null|null|null|
| 2|null|null|null|
+---+----+----+----+
Can anyone help to get the desire result.
val m= map(lit("A"), lit("n1"), lit("B"),lit("n2"), lit("C"), lit("n3"), lit("D"), lit("n1"), lit("E"), lit("n2"))
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
df.withColumn("new", m($"Name")).groupBy("Eno").pivot("new").agg(first("Name"))
+---+---+---+----+
|Eno| n1| n2| n3|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+
import org.apache.spark.sql.functions._
import spark.implicits._
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
val getName=udf {(names: Seq[String],i : Int) => if (names.size>i) names(i) else null}
val tdf=df.groupBy($"Eno").agg(collect_list($"name").as("names"))
val ndf=(0 to 2).foldLeft(tdf){(ndf,i) => ndf.withColumn(s"n${i}",getName($"names",lit(i))) }.
drop("names")
ndf.show()
+---+---+---+----+
|Eno| n0| n1| n2|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+

Combining RDD's with some values missing

Hi I have two RDD's I want to combine into 1.
The first RDD is of the format
//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
I have another RDD
//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))
So the keys in the second RDD are more in no. but overlap with RDD1. I need to combine the RDD's so that only those keys of 2nd RDD append to RDD1 which are not there in RDD1.
You can do something like this -
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)
val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))
val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))
val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)
// Convert to Dataframe so it is easier to handle
val df1 = rdd1.toDF
val df2 = rdd2.toDF
// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
+----+---+----+
df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| -1|
| U5| M4| 3|
| U6| M6| -1|
| U3| M2| -1|
| U4| M2| -1|
| U4| M3| 5|
+----+---+----+
// Figure out the extra reviews in second dataframe that do not match (user, mov) in first
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")
// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
// Final result of combining only unique values in df2
unionByName(df1, xtraReviews).show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
| U5| M4| 3|
| U4| M3| 5|
| U6| M6| -1|
+----+---+----+
It might also be possible to do it in this way:
RDD's are really slow, so read your data or convert your data in dataframes.
Use spark dropDuplicates() on both the dataframes like df.dropDuplicates(['Key1', 'Key2']) to get distinct values on keys in both of your dataframe and then
simply union them like df1.union(df2).
Benefit is you are doing it in spark way and hence you have all the parallelism and speed.

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
(0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)
How can I map over the array and get my output?
You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)
aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+
First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults = combined.rdd.map{
x =>
(x.getInt(0), x.toSeq.tail.map(_.asInstanceOf[Double]))
}.map{
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual(g.map(x => x(0)))
val g2 = checkAllEqual(g.map(x => x(1)))
val g3 = checkAllEqual(g.map(x => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
finalResults.show()
This will print:
+---+-----+-----+-----+
| id|prod1|prod2|prod3|
+---+-----+-----+-----+
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|
+---+-----+-----+-----+