Apply same common header to distinct fields of dataframes in scala spark - scala

I want to apply a same common header to all dataframes I generate. The application must know which ones have to change/add/remove and in which position.
The distinct dataframes come with different column order, some columns left, some columns added. What I want is:
If there are more columns than the common common header, these ones will be removed.
If some column(s) left, add the columns left and fill in the rows with null values
// df with common header to apply
val mainDF = Seq(("a","b","c","d","e")).toDF("first","second","third","fourth","fifth")
mainDF.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| e|
+-----+------+-----+------+-----+
// Case 1: distinct column order
val df1 = Seq(("a", "c","b","d","e")).toDF("first","third","second","fourth","fifth")
df1.show()
+-----+-----+------+------+-----+
|first|third|second|fourth|fifth|
+-----+-----+------+------+-----+
| a| c| b| d| e|
+-----+-----+------+------+-----+
// Result desired:
val df1_correct = Seq(("a","b","c","d","e")).toDF("first","second","third","fourth","fifth")
df1_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| e|
+-----+------+-----+------+-----+
// Case 2: columns left
val df2 = Seq(("a", "b","c","d")).toDF("first","second","third","fourth")
df2.show()
+-----+------+-----+------+
|first|second|third|fourth|
+-----+------+-----+------+
| a| b| c| d|
+-----+------+-----+------+
// Result desired:
val df2_correct = Seq(("a","b","c","d","null")).toDF("first","second","third","fourth","fifth")
df2_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| null|
+-----+------+-----+------+-----+
// Case 3: columns added
val df3 = Seq(("a", "b","c","d","e","f")).toDF("first","second","third","fourth","fifth","sixth")
df3.show()
+-----+------+-----+------+-----+-----+
|first|second|third|fourth|fifth|sixth|
+-----+------+-----+------+-----+-----+
| a| b| c| d| e| f|
+-----+------+-----+------+-----+-----+
// Result desired:
val df3_correct = Seq(("a","b","c","d","e")).toDF("first","second","third","fourth","fifth")
df3_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| e|
+-----+------+-----+------+-----+
// case 4: distinct column order and e.g a column left
val df4 =
Seq(("a", "c","b","d")).toDF("first","third","second","fourth")
df4.show()
+-----+-----+------+------+
|first|third|second|fourth|
+-----+-----+------+------+
| a| c| b| d|
+-----+-----+------+------+
// Result desired:
val df4_correct = Seq(("a","b","c","d","null")).toDF("first","second","third","fourth","fifth")
df4_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| null|
+-----+------+-----+------+-----+

This should cover all your cases:
val selectExp = mainDF.columns.map( c => if(df4.columns.contains(c)) col(c)
else lit(null).as(c) )
You map over mainDF.columns which is an Array[String] of the column names of mainDF
Array[String] = Array(first, second, third, fourth, fifth)
Replace df4 with whichever dataframe you want to generate expression for. If the column in dfx matches with mainDF, then select it, otherwise generate a null with the column name fetched from mainDF
You will get a Array[org.apache.spark.sql.Column]
Array[org.apache.spark.sql.Column] = Array(first, second, third, fourth, NULL AS `fifth`)
which you can use on the df as
df4.select(selectExp : _*).show
//+-----+------+-----+------+-----+
//|first|second|third|fourth|fifth|
//+-----+------+-----+------+-----+
//| a| b| c| d| null|
//+-----+------+-----+------+-----+

Related

Collect most occurring unique values across columns after a groupby in Spark

I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("XX","j","i","h",null))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| XX| j| i| h| null|
+----+------+------+------+------+
I need to group by the main column and pick the two most occurring values from the remaining columns for each main value
I did the following
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
val newdf2 = newdf.groupBy('main).agg(collect_set('values).alias("values"))
val newdf3 = newdf2.select('main, flatten($"values").alias("values"))
To get the data in the following form
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j, i, h,]|
+----+--------------------+
Now I need to pick the most occurring two items from the list as two columns. Dunno how to do that.
So, in this case the expected output should be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| i|
+----+------+------+
null should not be counted and the final values should be null only if there are no other values to fill
Is this the best way to do things ? Is there a better way of doing it ?
You can use an udf to select the two values from the array that occur the most often.
input.withColumn("values", array("value1", "value2", "value3", "value4"))
.groupBy("main").agg(flatten(collect_list("values")).as("values"))
.withColumn("max", maxUdf('values)) //(1)
.cache() //(2)
.withColumn("value1", 'max.getItem(0))
.withColumn("value2", 'max.getItem(1))
.drop("values", "max")
.show(false)
with maxUdf being defined as
def getMax[T](array: Seq[T]) = {
array
.filter(_ != null) //remove null values
.groupBy(identity).mapValues(_.length) //count occurences of each value
.toSeq.sortWith(_._2 > _._2) //sort (3)
.map(_._1).take(2) //return the two (or one) most common values
}
val maxUdf = udf(getMax[String] _)
Remarks:
using an udf here means that the whole array with all entries for a single value of main has to fit into the memory of one Spark executor
cache is required here or the the udf will be called twice, once for value1 and once for value2
the sortWith here is stable but it might be necessary to add some extra logic to handle the situation if two elements have the same number of occurences (like i, j and h for the main value XX)
Here is my try without udf.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('main).orderBy('count.desc)
newdf3.withColumn("values", explode('values))
.groupBy('main, 'values).agg(count('values).as("count"))
.filter("values is not null")
.withColumn("target", concat(lit("value"), lit(row_number().over(w))))
.filter("target < 'value3'")
.groupBy('main).pivot('target).agg(first('values)).show
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| null|
+----+------+------+
The last row has the null value because I have modified your dataframe in this way,
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j,,,]| <- For null test
+----+--------------------+

spark scala: remove consecutive (by date) duplicates records from a dataframe

The question is regarding working with dataframes, I want to delete completely duplicate records excluding some fields (dates).
I tried to use a windowFunction (WindowSpec) as:
val wFromDupl: WindowSpec = Window
.partitionBy(comparateFields: _*)
.orderBy(asc(orderField))
At the variable comparateFields I store all the fields that I have to check (in the example it would be DESC1 and DESC2) to eliminate duplicates following the logic that, if there is a duplicate record, we discard those with higher date.
In the orderField variable, I simply store the effective_date field.
Therefore, by applying the window function, what I do is calculate a temporary column, assigning the smallest date to all the records that are duplicates, and then filter the dataFrame as:
val dfFinal: DataFrame = dfInicial
.withColumn("w_eff_date", min(col("effective_date")).over(wFromDupl))
.filter(col("effective_date") === col("w_eff_date"))
.drop("w_eff_date")
.distinct()
.withColumn("effective_end_date", lead(orderField, 1, "9999-12-31").over(w))
For the following case it works correctly:
KEY EFFECTIVE_DATE DESC 1 DESC 2 W_EFF_DATE (tmp)
E2 2000 A B 2000
E2 2001 A B 2000
E2 2002 AA B 2002
The code will drop the second record:
E2 2001 A B 2000
But the logic must be applied for CONSECUTIVE records (in date), for example, for the following case, as the code is implemented, we are deleting the third record (DESC1 and DESC2 are the same, and the min eff date is 2000), but we dont want this because we have (by eff_date) a record in the middle (2001 AA B)so we want to keep the 3 records
KEY EFFECTIVE_DATE DESC1 DESC2 W_EFF_DATE (tmp)
E1 2000 A B 2000
E1 2001 AA B 2001
E1 2002 A B 2000
Any advice on this?
Thank you all!
One approach would be to use when/otherwise along with Window function lag to determine which rows to keep, as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("E1", "2000", "A", "B"),
("E1", "2001", "AA", "B"),
("E1", "2002", "A", "B"),
("E1", "2003", "A", "B"),
("E1", "2004", "A", "B"),
("E2", "2000", "C", "D"),
("E2", "2001", "C", "D"),
("E2", "2002", "CC", "D"),
("E2", "2003", "C", "D")
).toDF("key", "effective_date", "desc1", "desc2")
val compareCols = List("desc1", "desc2")
val win1 = Window.partitionBy("key").orderBy("effective_date")
val df2 = df.
withColumn("compCols", struct(compareCols.map(col): _*)).
withColumn("rowNum", row_number.over(win1)).
withColumn("toKeep",
when($"rowNum" === 1 || $"compCols" =!= lag($"compCols", 1).over(win1), true).
otherwise(false)
)
// +---+--------------+-----+-----+--------+------+------+
// |key|effective_date|desc1|desc2|compCols|rowNum|toKeep|
// +---+--------------+-----+-----+--------+------+------+
// | E1| 2000| A| B| [A,B]| 1| true|
// | E1| 2001| AA| B| [AA,B]| 2| true|
// | E1| 2002| A| B| [A,B]| 3| true|
// | E1| 2003| A| B| [A,B]| 4| false|
// | E1| 2004| A| B| [A,B]| 5| false|
// | E2| 2000| C| D| [C,D]| 1| true|
// | E2| 2001| C| D| [C,D]| 2| false|
// | E2| 2002| CC| D| [CC,D]| 3| true|
// | E2| 2003| C| D| [C,D]| 4| true|
// +---+--------------+-----+-----+--------+------+------+
df2.where($"toKeep").select(df.columns.map(col): _*).
show
// +---+--------------+-----+-----+
// |key|effective_date|desc1|desc2|
// +---+--------------+-----+-----+
// | E1| 2000| A| B|
// | E1| 2001| AA| B|
// | E1| 2002| A| B|
// | E2| 2000| C| D|
// | E2| 2002| CC| D|
// | E2| 2003| C| D|
// +---+--------------+-----+-----+

Converting row into column using spark scala

I want to covert row into column using spark dataframe.
My table is like this
Eno,Name
1,A
1,B
1,C
2,D
2,E
I want to convert it into
Eno,n1,n2,n3
1,A,B,C
2,D,E,Null
I used this below code :-
val r = spark.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load("C:\\Users\\axy\\Desktop\\abc2.csv")
val n =Seq("n1","n2","n3"
r
.groupBy("Eno")
.pivot("Name",n).agg(expr("coalesce(first(Name),3)").cast("double")).show()
But I am getting result as-->
+---+----+----+----+
|Eno| n1| n2| n3|
+---+----+----+----+
| 1|null|null|null|
| 2|null|null|null|
+---+----+----+----+
Can anyone help to get the desire result.
val m= map(lit("A"), lit("n1"), lit("B"),lit("n2"), lit("C"), lit("n3"), lit("D"), lit("n1"), lit("E"), lit("n2"))
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
df.withColumn("new", m($"Name")).groupBy("Eno").pivot("new").agg(first("Name"))
+---+---+---+----+
|Eno| n1| n2| n3|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+
import org.apache.spark.sql.functions._
import spark.implicits._
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
val getName=udf {(names: Seq[String],i : Int) => if (names.size>i) names(i) else null}
val tdf=df.groupBy($"Eno").agg(collect_list($"name").as("names"))
val ndf=(0 to 2).foldLeft(tdf){(ndf,i) => ndf.withColumn(s"n${i}",getName($"names",lit(i))) }.
drop("names")
ndf.show()
+---+---+---+----+
|Eno| n0| n1| n2|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+

scala: method that return varargs

From a scala method, I want to return a variable number a Spark columns, like this:
def getColumns() : (Column*) = {...}
This idea is then to use it with spark sql:
myDf.select(getColumns, "anotherColumns"..)
The thing is I have about 30 requests that all have the same select clause, that I want to put in common.
Any idea what to replace with the ...? I tried something like:
($"col1", "$col2")
but it doesn't compile.
Try this:
val df = Seq((1,2,3,4),(5,6,7,8)).toDF("a","b","c","d")
Typecast String to spark columns using map function and append addition column in an array as required.
val lstCols = List("a","b")
df.select(lstCols.map(col) ++ List(col("c"),col("d")): _*).show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 6| 7| 8|
+---+---+---+---+

Spark join produces wrong results

Presenting here before possibly filing a bug. I'm using Spark 1.6.0.
This is a simplified version of the problem I'm dealing with. I've filtered a table, and then I'm trying to do a left outer join with that subset and the main table, matching all the columns.
I've only got 2 rows in the main table and one in the filtered table. I'm expecting the resulting table to only have the single row from the subset.
scala> val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
b: org.apache.spark.sql.DataFrame = [a: string, b: string, c: int]
scala> val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
a: org.apache.spark.sql.DataFrame = [filta: string, filtb: string, c: int]
scala> a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
| a| b| 1| a| b| 2|
+-----+-----+---+---+---+---+
I didn't expect that result at all. I expected the first row, but not the second. I suspected it's the null-safe equality, so I tried it without.
scala> a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
16/03/21 12:50:00 WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232'. Perhaps you need to use aliases.
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
+-----+-----+---+---+---+---+
OK, that's the result I expected, but then I got suspicious of the warning. There is a separate StackOverflow question to deal with that warning here: Spark SQL performing carthesian join instead of inner join
So I create a new column that avoids the warning.
scala> a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
+-----+-----+---+----+---+---+---+
|filta|filtb| c|newc| a| b| c|
+-----+-----+---+----+---+---+---+
| a| b| 1| 1| a| b| 1|
| a| b| 1| 1| a| b| 2|
+-----+-----+---+----+---+---+---+
But now the result is wrong again!
I have a lot of null-safe equality checks, and the warning isn't fatal, so I don't see a clear path to working with/around this.
Is the behaviour a bug, or is this expected behaviour? If expected, why?
If you want an expected behavior use either join on names:
val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a = b.where("c = 1")
a.join(b, Seq("a", "b", "c")).show
// +---+---+---+
// | a| b| c|
// +---+---+---+
// | a| b| 1|
// +---+---+---+
or aliases:
val aa = a.alias("a")
val bb = b.alias("b")
aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")
You can use <=> as well:
aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")
As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.
The second behavior looks indeed like a bug related to the fact that you still have a.c in your data. It looks like it is picked downstream before b.c and the evaluated condition is actually a.newc = a.c.
val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")