I use Spark 2.0.1 Scala 2.11, and this question is related to this
Below is the setup:
val ss = new StructType().add("x", IntegerType).add("y", MapType(DoubleType, IntegerType))
val s = new StructType()
.add("a", IntegerType)
.add("b", ss)
val d = Seq(Row(1, Row(1,Map(1.0->1, 2.0->2))),
Row(2, Row(2,Map(2.0->2, 3.0->3))),
Row(3, null ),
Row(4, Row(4, Map())))
val rd = sc.parallelize(d)
val df = spark.createDataFrame(rd, s)
df.select($"a", $"b").show(false)
// +---+---------------------------+
// |a |b |
// +---+---------------------------+
// |1 |[1,Map(1.0 -> 1, 2.0 -> 2)]|
// |2 |[2,Map(2.0 -> 2, 3.0 -> 3)]|
// |3 |null |
// |4 |[4,Map()] |
// +---+---------------------------+
//
The below statement works when I have to provide a default to coalesce (row 2 col 3 cell has the default value):
df.groupBy($"a").pivot("a").
agg(expr("first(coalesce(b, named_struct('x', cast(null as Int), 'y', Map(0.0D, 0) )))" ) )
.show(false)
// +---+---------------------------+---------------------------+--------------------+---------+
// |a |1 |2 |3 |4 |
// +---+---------------------------+---------------------------+--------------------+---------+
// |1 |[1,Map(1.0 -> 1, 2.0 -> 2)]|null |null |null |
// |3 |null |null |[null,Map(0.0 -> 0)]|null |
// |4 |null |null |null |[4,Map()]|
// |2 |null |[2,Map(2.0 -> 2, 3.0 -> 3)]|null |null |
// +---+---------------------------+---------------------------+--------------------+---------+
But how to create an empty Map() (like what's seen in a=4) using named_struct or otherwise?
You can achieve this with an case class and an UDF:
case class MyStruct(x:Option[Int], y:Map[Double,Int])
import org.apache.spark.sql.functions.{udf, first,coalesce}
val emptyStruct = udf(() => MyStruct(None,Map.empty[Double,Int]))
df.groupBy($"a").pivot("a")
.agg(first(coalesce($"b",emptyStruct())))
.show(false)
Related
I have a dataframe df with columns a,b,c,d,e,f,g.
I have a scala List L1 which is List[Any] = List(a,b,c)
How to perform a group by operation on DF and find duplicates if any using the list L1
Also how to find out if the dataframe has nulls/blanks/emptyvalues for the columns which are mentioned in list L1
e.g. df.groupby(l1) needs to be used as l1 may vary from time to time
// Null
case class Source(
a: Option[String],
b: Option[String],
c: Option[String],
d: Option[String],
e: Option[String],
f: Option[String],
g: Option[String] )
val l = List("a", "b", "c")
val sourceDF = Seq(
Source(None, Some("b1"), Some("c1"), Some("d1"), Some("e1"), Some("f1"), Some("g1")),
Source(Some("a2"), None, Some("c2"), Some("d2"), Some("e2"), Some("f2"), Some("g2")),
Source(Some("a3"), Some("b3"), None, Some("d3"), Some("e3"), Some("f3"), Some("g3")),
Source(Some("a4"), Some("b4"), Some("c4"), Some("d4"), Some("e4"), Some("f4"), Some("g4"))
).toDF()
sourceDF.show(false)
// +----+----+----+---+---+---+---+
// |a |b |c |d |e |f |g |
// +----+----+----+---+---+---+---+
// |null|b1 |c1 |d1 |e1 |f1 |g1 |
// |a2 |null|c2 |d2 |e2 |f2 |g2 |
// |a3 |b3 |null|d3 |e3 |f3 |g3 |
// |a4 |b4 |c4 |d4 |e4 |f4 |g4 |
// +----+----+----+---+---+---+---+
val f1 = l.map(i => s" $i is null").mkString(" or ")
sourceDF.where(f1).show(false)
// +----+----+----+---+---+---+---+
// |a |b |c |d |e |f |g |
// +----+----+----+---+---+---+---+
// |null|b1 |c1 |d1 |e1 |f1 |g1 |
// |a2 |null|c2 |d2 |e2 |f2 |g2 |
// |a3 |b3 |null|d3 |e3 |f3 |g3 |
// +----+----+----+---+---+---+---+
// groupBy
val gbDF = sourceDF.groupBy(l.head, l.tail:_*).count()
gbDF.show(false)
// +----+----+----+-----+
// |a |b |c |count|
// +----+----+----+-----+
// |a2 |null|c2 |1 |
// |a4 |b4 |c4 |1 |
// |a3 |b3 |null|1 |
// |null|b1 |c1 |1 |
// +----+----+----+-----+
Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
I have two dataframes i.e. left & right. I have working solution for my question. I need a way to make it generic. My question is at the end here.
leftDF:
+------+---------+-------+-------+
|leftId|leftAltId|leftCur|leftAmt|
+------+---------+-------+-------+
|1 |100 |USD |20 |
|2 |200 |INR |100 |
|4 |500 |MXN |100 |
+------+---------+-------+-------+
rightDF:
+-------+----------+--------+--------+
|rightId|rightAltId|rightCur|rightAmt|
+-------+----------+--------+--------+
|1 |300 |USD |20 |
|3 |400 |MXN |100 |
|4 |600 |MXN |200 |
+-------+----------+--------+--------+
I want to perform a join between these two dataframes and I expect four dataframes as output
transactions that exists in leftDF & not in rightDF
transactions that exists in rightDF & not in leftDF
transactions that have alteast one of the ids common between two dataframes
3.a Strict Match : same currency, amount between two dataframes. Example: transaction with id 1.
3.b Relaxed Match : transactions that have same Id but different currency/amount combo. Example transaction with id 4.
Here's the desired output:
transactions that exists in leftDF & not in rightDF
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|2 |200 |INR |100 |null |null |null |null |
+------+---------+-------+-------+-------+----------+--------+--------+
transactions that exists in rightDF & not in leftDF
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAtId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|null |null |null |null |3 |400 |MXN |100 |
+------+---------+-------+-------+-------+----------+--------+--------+
transactions that have alteast one of the ids common between two dataframes
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|1 |100 |USD |20 |1 |300 |USD |20 |
|4 |500 |MXN |100 |4 |600 |MXN |200 |
+------+---------+-------+-------+-------+----------+--------+--------+
3.a Strict Match : same currency, amount between two dataframes. Example: transaction with id 1.
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|1 |100 |USD |20 |1 |300 |USD |20 |
+------+---------+-------+-------+-------+----------+--------+--------+
3.b Relaxed Match : transactions that have same Id but different currency/amount combo. Example transaction with id 4.
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|4 |500 |MXN |100 |4 |600 |MXN |200 |
+------+---------+-------+-------+-------+----------+--------+--------+
Here's the working code I have for it:
import sparkSession.implicits._
val leftDF: DataFrame = Seq((1, 100, "USD", 20), (2, 200, "INR", 100), (4, 500, "MXN", 100)).toDF("leftId", "leftAltId", "leftCur", "leftAmt")
val rightDF: DataFrame = Seq((1, 300, "USD", 20), (3, 400, "MXN", 100), (4, 600, "MXN", 200)).toDF("rightId", "rightAltId", "rightCur", "rightAmt")
leftDF.show(false)
rightDF.show(false)
val idMatchQuery = leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
val currencyMatchQuery = leftDF("leftCur") === rightDF("rightCur") && leftDF("leftAmt") === rightDF("rightAmt")
val leftOnlyQuery = (col("leftId").isNotNull && col("rightId").isNull) || (col("leftAltId").isNotNull && col("rightAltId").isNull)
val rightOnlyQuery = (col("rightId").isNotNull && col("leftId").isNull) || (col("rightAltId").isNotNull && col("leftAltId").isNull)
val matchQuery = (col("rightId").isNotNull && col("leftId").isNotNull) || (col("rightAltId").isNotNull && col("leftAltId").isNotNull)
val result = leftDF.join(rightDF, idMatchQuery, "fullouter")
val leftOnlyDF = result.filter(leftOnlyQuery)
val rightOnlyDF = result.filter(rightOnlyQuery)
val matchDF = result.filter(matchQuery)
val strictMatchDF = matchDF.filter(currencyMatchQuery.equalTo(true))
val relaxedMatchDF = matchDF.filter(currencyMatchQuery.equalTo(false))
leftOnlyDF.show(false)
rightOnlyDF.show(false)
matchDF.show(false)
strictMatchDF.show(false)
relaxedMatchDF.show(false)
What I'm looking for:
I want to be able to take the column names to join on, as a list and make the code generic.
for e.g.
val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))
I want to be able to take the column names to join on, as a list and make the code generic.
This is not a perfect suggestion but would definitely help you get generalized. The suggestion is to go with foldLeft
val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
val rHead = relaxedJoinList.head
val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))
val sHead = strictJoinList.head
val idMatchQuery = relaxedJoinList.tail.foldLeft(leftDF(rHead._1) === rightDF(rHead._2)){(x, y) => x || leftDF(y._1) === rightDF(y._2)}
val currencyMatchQuery = strictJoinList.tail.foldLeft(leftDF(sHead._1) === rightDF(sHead._2)){(x, y) => x && leftDF(y._1) === rightDF(y._2)}
val leftOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNull}
val rightOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNull && col(y._2).isNotNull}
val matchQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNotNull}
The rest of the code is as yours
I hope the answer is helpful
I am having a dataframe like
x y
_ _
1 10
2 30
3 50
4 24
5 36
6 45
I want to append another column z which will be depending on the value of y.
So i created a function
def GiveNumVal(col: Column) => Integer = {
if(Column>=0 && Column<15){
return 1;
}
else if(Column>=15 && Column<30){
return 2;
}
else if(Column>=30 && Column<45){
return 3;
}
else if (Column>=45 && Column<=59){
return 4;
}
else{
return 0;
}
}
And I call by
val new_df=df.withColumn("z",GiveNumVal($"y"));
It cant even compile. I am not sure where is the wrong part.Any help is appreciated.
You need to register the udf to be used or create a udf like this one
import org.apache.spark.sql.functions._
// create dataframe
val df = Seq(
(1, 10),
(2, 30),
(3, 50),
(4, 24),
(5, 36),
(6, 45)
).toDF("x", "y")
//create udf
def giveNumVal = udf((c : Int) => {
if(c >=0 && c <15) 1
else if(c >=15 && c <30) 2
else if(c >=30 && c <45) 3
else if (c >=45 && c <=59) 4
else 0
})
And use it as
val new_df=df.withColumn("z",giveNumVal($"y"));
If you have a general function and want to use as udf then you can register as
//general function
def giveNumVal = (c : Int) => {
//implementation here
}
//To register
val GiveNumVal = spark.sqlContext.udf.register("functionName", giveNumVal)
Output :
+---+---+---+
|x |y |z |
+---+---+---+
|1 |10 |1 |
|2 |30 |3 |
|3 |50 |4 |
|4 |24 |2 |
|5 |36 |3 |
|6 |45 |4 |
+---+---+---+
Note: You don't need return statement and ; in scala
Hope this helps!
You should be using when inbuilt function inside your GiveNumVal function because if else condition won't work on columns.
import org.apache.spark.sql.functions._
def GiveNumVal(col: Column) = {
when(col >= 0 && col < 15, 1).otherwise(
when(col >= 15 && col < 30, 2).otherwise(
when(col >= 30 && col < 45, 3).otherwise(
when(col >= 45 && col <= 59, 4).otherwise(0)
)
)
)
}
val new_df = df.withColumn("z", GiveNumVal($"y"))
new_df.show(false)
which should give you
+---+---+---+
|x |y |z |
+---+---+---+
|1 |10 |1 |
|2 |30 |3 |
|3 |50 |4 |
|4 |24 |2 |
|5 |36 |3 |
|6 |45 |4 |
+---+---+---+
GiveNumVal function returns a column and not an Integer
I hope the answer is helpful
Hello guys I want to update an old dataframe based on pos_id and article_id field.
If the tuple (pos_id,article_id) exist , I will add each column to the old one, if it doesn't exist I will add the new one. It worked fine. But I don't know how to deal with the case , when the dataframe is intially empty , in this case , I will add the new rows in the second dataframe to the old one. Here it is what I did
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here is the solution , i found
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
This case doesn't work when hist1 is empty .Any help please ?
Thanks a lot
Not sure if I understood correctly, but if the problem is sometimes the second dataframe is empty, and that makes the join crash, something you can try is this:
val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
case Success(df) => {
hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
}
case Failure(e) => {
hist2.select($"pos_id", $"article_id",
coalesce(hist2("date")).alias("date"),
coalesce(hist2("qte"), lit(0)).alias("qte"),
coalesce(hist2("ca"), lit(0)).alias("ca"))
.orderBy("pos_id", "article_id")
}
}
This basically checks if the hist1 is empty before performing the join. In case it is empty it generates the df based on the same logic but applied only to the hist2 dataframe. In case it contains information it applies the logic you had, which you said that works.
instead of doing a join, why don't you do a union of the two dataframes and then groupBy (pos_id,article_id) and apply udf to each column sum for qte and ca.
val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))