Transform sequence of strings to join columns - scala

I have the following Sequence and DataFrames:
df1.select("link1", "link2").show
+-----+-----+
|link1|link2|
+-----+-----+
| 1| 1|
| 2| 1|
| 2| 1|
| 3| 1|
| 5| 2|
+-----+-----+
df2.select("link1_2", "link2_2").show
+-------+-------+
|link1_2|link2_2|
+-------+-------+
| 2| 1|
| 2| 4|
| 4| 1|
| 5| 2|
| 3| 4|
+-------+-------+
val col_names = Seq("link1", "link2")
I want to create the following link
df1.join(df2, 'link1 === 'link1_2 && 'link2 === 'link1_2)
without hard-coding the linking columns. I basically need a way to do the following transformation:
Seq("str1", "str2", ...) -> 'str1 === 'str1_2 && 'str2 === 'str1_2 && ...
I have tried the following approach which doesn't seem to work:
df1.join(df2, col_names map (str: String => col(str) === col(str + "_2")).foldLeft(true)(_ && _))
Does anybody know how to write the above transformation?

There is no need to traverse the column list twice. Just use foldLeft as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq(
(1, 1), (2, 1), (2, 1), (3, 1), (5, 2)
).toDF("c1", "c2")
val df2 = Seq(
(2, 1), (2, 4), (4, 1), (5, 2), (3, 4)
).toDF("c1_2", "c2_2")
val cols = Seq("c1", "c2")
df1.
join(df2, cols.foldLeft(lit(true))((cond, c) => cond && col(c) === col(c + "_2"))).
show
//+---+---+----+----+
//| c1| c2|c1_2|c2_2|
//+---+---+----+----+
//| 2| 1| 2| 1|
//| 2| 1| 2| 1|
//| 5| 2| 5| 2|
//+---+---+----+----+

Related

What are the Scala Spark performance of groupBy vs pivot?

I am facing an issue that I have to pivot a Spark Dataframe with different aggregation functions, based on the column value I decide to pivot. I am using this other question on SO as my starting point.
Let's take the following as starting point:
scala> val data = Seq((1, "k1", "measureA", 2), (1, "k1", "measureA", 4), (1, "k1", "measureB", 5), (1, "k1", "measureB", 7), (1, "k1", "measureC", 7), (1, "k1", "measureC", 1), (2, "k1", "measureB", 8), (2, "k1", "measureC", 9), (2, "k2", "measureA", 5), (2, "k2", "measureC", 5), (2, "k2", "measureC", 8))
data: Seq[(Int, String, String, Int)] = List((1,k1,measureA,2), (1,k1,measureA,4), (1,k1,measureB,5), (1,k1,measureB,7), (1,k1,measureC,7), (1,k1,measureC,1), (2,k1,measureB,8), (2,k1,measureC,9), (2,k2,measureA,5), (2,k2,measureC,5), (2,k2,measureC,8))
scala> val df = data.toDF("ts","key","measure_type","value")
df: org.apache.spark.sql.DataFrame = [ts: int, key: string ... 2 more fields]
scala> df.show
+---+---+------------+-----+
| ts|key|measure_type|value|
+---+---+------------+-----+
| 1| k1| measureA| 2|
| 1| k1| measureA| 4|
| 1| k1| measureB| 5|
| 1| k1| measureB| 7|
| 1| k1| measureC| 7|
| 1| k1| measureC| 1|
| 2| k1| measureB| 8|
| 2| k1| measureC| 9|
| 2| k2| measureA| 5|
| 2| k2| measureC| 5|
| 2| k2| measureC| 8|
+---+---+------------+-----+
What does perform better? A groupBy + agg:
val ddf = df.groupBy("ts", "key").agg(
sum(when(col("measure_type") === "measureA",col("value"))).as("measureA"),
avg(when(col("measure_type") === "measureB",col("value"))).as("measureB"),
max(when(col("measure_type") === "measureC",col("value"))).as("measureC"))
ddf.show
+---+---+--------+--------+--------+
| ts|key|measureA|measureB|measureC|
+---+---+--------+--------+--------+
| 1| k1| 6| 6.0| 7|
| 2| k1| null| 8.0| 9|
| 2| k2| 5| null| 8|
+---+---+--------+--------+--------+
Or pivot + agg:
val listA = Seq("measureA")
val listB = Seq("measureB")
val listC = Seq("measureC")
val ddf = df.groupBy("ts", "key").pivot(col("measure_type"), Seq("measureA", "measureB", "measureC")).agg(
sum(when(col("measure_type").isInCollection(listA),col("value"))).as("measureA"),
avg(when(col("measure_type").isInCollection(listB),col("value"))).as("measureB"),
max(when(col("measure_type").isInCollection(listC),col("value"))).as("measureC"))
ddf.show()
+---+---+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
| ts|key|measureA_measureA|measureA_measureB|measureA_measureC|measureB_measureA|measureB_measureB|measureB_measureC|measureC_measureA|measureC_measureB|measureC_measureC|
+---+---+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
| 2| k2| 5| null| null| null| null| null| null| null| 8|
| 2| k1| null| null| null| null| 8.0| null| null| null| 9|
| 1| k1| 6| null| null| null| 6.0| null| null| null| 7|
+---+---+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
I am aware that the second DataFrame that I got as output is different because it contains all the combinations from the Seq() of column names I passed to the pivot method and the number of the different aggregation functions (SQL CASEs, since there is when()) I decided to choose. So 3 * 3 = 9. But if you filter this second DataFrame removing the columns with only null then the result is the same.
Also I am wondering if I am doing something wrong in the second approach or there is a way to better rename the columns in order to avoid from the start the null columns.
val ddf = df.groupBy("ts", "key").pivot(col("measure_type"), Seq("measureA", "measureB", "measureC")).agg(
sum(when(col("measure_type").isInCollection(listA),col("value"))),
avg(when(col("measure_type").isInCollection(listB),col("value"))),
max(when(col("measure_type").isInCollection(listC),col("value"))))
ddf:org.apache.spark.sql.DataFrame
ts:integer
key:string
measureA_sum(CASE WHEN (measure_type IN (measureA)) THEN value END):long
measureA_avg(CASE WHEN (measure_type IN (measureB)) THEN value END):double
measureA_max(CASE WHEN (measure_type IN (measureC)) THEN value END):integer
measureB_sum(CASE WHEN (measure_type IN (measureA)) THEN value END):long
measureB_avg(CASE WHEN (measure_type IN (measureB)) THEN value END):double
measureB_max(CASE WHEN (measure_type IN (measureC)) THEN value END):integer
measureC_sum(CASE WHEN (measure_type IN (measureA)) THEN value END):long
measureC_avg(CASE WHEN (measure_type IN (measureB)) THEN value END):double
measureC_max(CASE WHEN (measure_type IN (measureC)) THEN value END):integer
I have decided not to post the ddf.show because of the very verbose headers. The result is the same as the pivot + agg example, just with the headers above listed.

Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord")

Assuming I have the following DataFrame:
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 3|null| 11|
| 2| null| 2| xxx| 22|
| 1| null| 1| yyy|null|
| 2| null| 7|null| 33|
| 1| null| 12|null|null|
| 2| null| 19|null| 77|
| 1| null| 10| s13|null|
| 2| null| 11| a23|null|
+---+--------+---+----+----+
here is the same sample DF with comments, sorted by grp and ord:
scala> df.orderBy("grp", "ord").show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 1| yyy|null|
| 1| null| 3|null| 11| # grp:1 - last value for `col2` (11)
| 1| null| 10| s13|null| # grp:1 - last value for `col1` (s13)
| 1| null| 12|null|null| # grp:1 - last values for `null_col`, `ord`
| 2| null| 2| xxx| 22|
| 2| null| 7|null| 33|
| 2| null| 11| a23|null| # grp:2 - last value for `col1` (a23)
| 2| null| 19|null| 77| # grp:2 - last values for `null_col`, `ord`, `col2`
+---+--------+---+----+----+
I would like to compress it. I.e. to group it by column "grp" and for each group, sort rows by the "ord" column and take the last not null value in each column (if there is one).
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 12| s13| 11|
| 2| null| 19| a23| 77|
+---+--------+---+----+----+
I've seen the following similar questions:
How to select the first row of each group?
How to find first non-null values in groups? (secondary sorting using dataset api)
but my real DataFrame has over 250 columns, so I need a solution where I don't have to specify all the columns explicitly.
I can't wrap my head around it...
MCVE: how to create a sample DataFrame:
create local file "/tmp/data.txt" and copy and paste there a context of the DataFrame (as it's posted above)
define function readSparkOutput():
parse "/tmp/data.txt" to DataFrame:
val df = readSparkOutput("file:///tmp/data.txt")
UPDATE: I think it should be similar to the following SQL:
SELECT
grp, ord, null_col, col1, col2
FROM (
SELECT
grp,
ord,
FIRST(null_col) OVER (PARTITION BY grp ORDER BY ord DESC) as null_col,
FIRST(col1) OVER (PARTITION BY grp ORDER BY ord DESC) as col1,
FIRST(col2) OVER (PARTITION BY grp ORDER BY ord DESC) as col2,
ROW_NUMBER() OVER (PARTITION BY grp ORDER BY ord DESC) as rn
FROM table_name) as v
WHERE v.rn = 1;
how can we dynamically generate such a Spark query?
I tried the following simplified approach:
import org.apache.spark.sql.expressions.Window
val win = Window
.partitionBy("grp")
.orderBy($"ord".desc)
val cols = df.columns.map(c => first(c, ignoreNulls=true).over(win).as(c))
which produces:
scala> cols
res23: Array[org.apache.spark.sql.Column] = Array(first(grp, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `grp`, first(null_col, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `null_col`, first(ord, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `ord`, first(col1, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col1`, first(col2, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col2`)
but i couldn't pass it to df.select:
scala> df.select(cols.head, cols.tail: _*).show
<console>:34: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
df.select(cols.head, cols.tail: _*).show
another attempt:
scala> df.select(cols.map(col): _*).show
<console>:34: error: type mismatch;
found : String => org.apache.spark.sql.Column
required: org.apache.spark.sql.Column => ?
df.select(cols.map(col): _*).show
Consider the following approach that applies Window function last(c, ignoreNulls=true) ordered by "ord" per "grp" to each of the selected columns; followed by a groupBy("grp") to fetch the first agg(colFcnMap) result:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df0 = Seq(
(1, 3, None, Some(11)),
(2, 2, Some("aaa"), Some(22)),
(1, 1, Some("s12"), None),
(2, 7, None, Some(33)),
(1, 12, None, None),
(2, 19, None, Some(77)),
(1, 10, Some("s13"), None),
(2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")
val df = df0.withColumn("null_col", lit(null))
df.orderBy("grp", "ord").show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// | 1| 1| s12|null| null|
// | 1| 3|null| 11| null|
// | 1| 10| s13|null| null|
// | 1| 12|null|null| null|
// | 2| 2| aaa| 22| null|
// | 2| 7|null| 33| null|
// | 2| 11| a23|null| null|
// | 2| 19|null| 77| null|
// +---+---+----+----+--------+
val win = Window.partitionBy("grp").orderBy("ord").
rowsBetween(0, Window.unboundedFollowing)
val nonAggCols = Array("grp")
val cols = df.columns.diff(nonAggCols) // Columns to be aggregated
val colFcnMap = cols.zip(Array.fill(cols.size)("first")).toMap
// colFcnMap: scala.collection.immutable.Map[String,String] =
// Map(ord -> first, col1 -> first, col2 -> first, null_col -> first)
cols.foldLeft(df)((acc, c) =>
acc.withColumn(c, last(c, ignoreNulls=true).over(win))
).
groupBy("grp").agg(colFcnMap).
select(col("grp") :: colFcnMap.toList.map{case (c, f) => col(s"$f($c)").as(c)}: _*).
show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// | 1| 12| s13| 11| null|
// | 2| 19| a23| 77| null|
// +---+---+----+----+--------+
Note that the final select is for stripping the function name (in this case first()) from the aggregated column names.
I have worked something out, here is the code and output
import org.apache.spark.sql.functions._
import spark.implicits._
val df0 = Seq(
(1, 3, None, Some(11)),
(2, 2, Some("aaa"), Some(22)),
(1, 1, Some("s12"), None),
(2, 7, None, Some(33)),
(1, 12, None, None),
(2, 19, None, Some(77)),
(1, 10, Some("s13"), None),
(2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")
df0.show()
//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//| 1| 3|null| 11|
//| 2| 2| aaa| 22|
//| 1| 1| s12|null|
//| 2| 7|null| 33|
//| 1| 12|null|null|
//| 2| 19|null| 77|
//| 1| 10| s13|null|
//| 2| 11| a23|null|
//+---+---+----+----+
Ordering the data on first 2 columns
val df1 = df0.select("grp", "ord", "col1", "col2").orderBy("grp", "ord")
df1.show()
//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//| 1| 1| s12|null|
//| 1| 3|null| 11|
//| 1| 10| s13|null|
//| 1| 12|null|null|
//| 2| 2| aaa| 22|
//| 2| 7|null| 33|
//| 2| 11| a23|null|
//| 2| 19|null| 77|
//+---+---+----+----+
val df2 = df1.groupBy("grp").agg(max("ord").alias("ord"),collect_set("col1").alias("col1"),collect_set("col2").alias("col2"))
val df3 = df2.withColumn("new_col1",$"col1".apply(size($"col1").minus(1))).withColumn("new_col2",$"col2".apply(size($"col2").minus(1)))
df3.show()
//+---+---+----------+------------+--------+--------+
//|grp|ord| col1| col2|new_col1|new_col2|
//+---+---+----------+------------+--------+--------+
//| 1| 12|[s12, s13]| [11]| s13| 11|
//| 2| 19|[aaa, a23]|[33, 22, 77]| a23| 77|
//+---+---+----------+------------+--------+--------+
You can drop the columns you don't need by using .drop("column_name")
So here we are grouping by a and selecting the max of all other columns in the group:
scala> val df = List((1,2,11), (1,1,1), (2,1,4), (2,3,5)).toDF("a", "b", "c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> val aggCols = df.schema.map(_.name).filter(_ != "a").map(colName => sum(col(colName)).alias(s"max_$colName"))
aggCols: Seq[org.apache.spark.sql.Column] = List(sum(b) AS `max_b`, sum(c) AS `max_c`)
scala> df.groupBy(col("a")).agg(aggCols.head, aggCols.tail: _*)
res0: org.apache.spark.sql.DataFrame = [a: int, max_b: bigint ... 1 more field]
I'd go with same approach like #LeoC, but I believe that there is no need to manipulate column names as string and I would go with a more spark-sql like answer.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, last}
val win = Window.partitionBy("grp").orderBy(col("ord")).rowsBetween(0, Window.unboundedFollowing)
// In case there is more than one group column
val nonAggCols = Seq("grp")
// Select columns to aggregate on
val cols: Seq[String] = df.columns.diff(nonAggCols).toSeq
// Map over selection and apply fct
val aggregations: Seq[Column] = cols.map(c => first(col(c), ignoreNulls = true).as(c))
// I'd rather cache the following step as it might get expensive
val step1 = cols.foldLeft(df)((acc, c) => acc.withColumn(c, last(col(c), ignoreNulls = true).over(win))).cache
// Finally we can aggregate our results as followed
val results = step1.groupBy(nonAggCols.head, nonAggCols.tail: _*).agg(aggregations.head, aggregations.tail: _*)
results.show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// | 1| null| 12| s13| 11|
// | 2| null| 19| a23| 77|
// +---+--------+---+----+----+
I hope this helps.
EDIT: The reason you are not getting the same results is because the reader that you are using isn't correct.
It interprets null from the file as a string and not a null; i.e :
scala> df.filter('col1.isNotNull).show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// | 1| null| 3|null| 11|
// | 2| null| 2| xxx| 22|
// | 1| null| 1| yyy|null|
// | 2| null| 7|null| 33|
// | 1| null| 12|null|null|
// | 2| null| 19|null| 77|
// | 1| null| 10| s13|null|
// | 2| null| 11| a23|null|
// +---+--------+---+----+----+
Here is my version of readSparkOutput :
def readSparkOutput(filePath: String): org.apache.spark.sql.DataFrame = {
val step1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "|")
.option("parserLib", "UNIVOCITY")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("comment", "+")
.csv(filePath)
val step2 = step1.select(step1.columns.filterNot(_.startsWith("_c")).map(step1(_)): _*)
val columns = step2.columns
columns.foldLeft(step2)((acc, c) => acc.withColumn(c, when(col(c) =!= "null" or col(c).isNotNull, col(c))))
}
Here is your answer (and hopefully my bounty!!!)
scala> val df = spark.sparkContext.parallelize(List(
| (1,null.asInstanceOf[String],3,null.asInstanceOf[String],new Integer(11)),
| (2,null.asInstanceOf[String],2,new String("xxx"),new Integer(22)),
| (1,null.asInstanceOf[String],1,new String("yyy"),null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],7,null.asInstanceOf[String],new Integer(33)),
| (1,null.asInstanceOf[String],12,null.asInstanceOf[String],null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],19,null.asInstanceOf[String],new Integer(77)),
| (1,null.asInstanceOf[String],10,new String("s13"),null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],11,new String("a23"),null.asInstanceOf[Integer]))).toDF("grp","null_col","ord","col1","col2")
df: org.apache.spark.sql.DataFrame = [grp: int, null_col: string ... 3 more fields]
scala> df.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 3|null| 11|
| 2| null| 2| xxx| 22|
| 1| null| 1| yyy|null|
| 2| null| 7|null| 33|
| 1| null| 12|null|null|
| 2| null| 19|null| 77|
| 1| null| 10| s13|null|
| 2| null| 11| a23|null|
+---+--------+---+----+----+
//Create window specification
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("grp").orderBy($"ord".desc)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#71878833
//Use foldLeft with first over window specification over all columns and take distinct
scala> val result = df.columns.foldLeft(df)((df, colName) => df.withColumn(colName, first(colName, ignoreNulls=true).over(win).as(colName))).distinct
result: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [grp: int, null_col: string ... 3 more fields]
scala> result.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 12| s13| 11|
| 2| null| 19| a23| 77|
+---+--------+---+----+----+
Hope this helps.

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.
Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()
Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

What is the right way to join these 2 Spark DataFrames?

Let's assume I have 2 spark DataFrames:
val addStuffDf = Seq(
("A", "2018-03-22", 5),
("A", "2018-03-24", 1),
("B", "2018-03-24, 3))
.toDF("user", "dt", "count")
val removedStuffDf = Seq(
("C", "2018-03-25", 10),
("A", "2018-03-24", 5),
("B", "2018-03-25", 1)
).toDF("user", "dt", "count")
and in the end I want to get a single dataframe with a summary statistics like this (ordering doesn't matter, actually):
+----+----------+-----+-------+
|user| dt|added|removed|
+----+----------+-----+-------+
| A|2018-03-22| 5| 0|
| A|2018-03-24| 1| 5|
| B|2018-03-24| 3| 0|
| B|2018-03-25| 0| 1|
| C|2018-03-25| 0| 10|
+----+----------+-----+-------+
It's quite clear that I can simply rename the "count" columns at "step 0", so to have dataframes df1 and df2
val df1 = addedDf.withColumnRenamed("count", "added")
df1.show()
+----+----------+-----+
|user| dt|added|
+----+----------+-----+
| A|2018-03-22| 5|
| A|2018-03-24| 1|
| B|2018-03-24| 3|
+----+----------+-----+
val df2 = removedDf.withColumnRenamed("count", "removed")
df2.show()
+----+----------+-------+
|user| dt|applied|
+----+----------+-------+
| C|2018-03-25| 10|
| A|2018-03-24| 5|
| B|2018-03-25| 1|
+----+----------+-------+
But now I'm failing to define "step 1" - namely, to determine the transform that would zip together df1 and df2.
From the logical standpoint full_outer join brings all the rows I need in a single DF, but then I need to merge duplicating columns somehow:
df1.as('d1)
.join(df2.as('d2),
($"d1.user"===$"d2.user" && $"d1.dt"===$"d2.dt"),
"full_outer")
.show()
+----+----------+-----+----+----------+-------+
|user| dt|added|user| dt|applied|
+----+----------+-----+----+----------+-------+
|null| null| null| C|2018-03-25| 10|
|null| null| null| B|2018-03-25| 1|
| B|2018-03-24| 3|null| null| null|
| A|2018-03-22| 5|null| null| null|
| A|2018-03-24| 1| A|2018-03-24| 5|
+----+----------+-----+----+----------+-------+
How can I merge these user and dt columns together? And, overall - am I using the correct approach to solve my problem or is there a more straightforward/efficient solution?
Since the columns to be joined for the two DataFrames have matching names, using Seq("user", "dt") for the join conditions will result in the merged table you want:
val addStuffDf = Seq(
("A", "2018-03-22", 5),
("A", "2018-03-24", 1),
("B", "2018-03-24", 3)
).toDF("user", "dt", "count")
val removedStuffDf = Seq(
("C", "2018-03-25", 10),
("A", "2018-03-24", 5),
("B", "2018-03-25", 1)
).toDF("user", "dt", "count")
val df1 = addStuffDf.withColumnRenamed("count", "added")
val df2 = removedStuffDf.withColumnRenamed("count", "removed")
df1.as('d1).join(df2.as('d2), Seq("user", "dt"), "full_outer").
na.fill(0).
show
// +----+----------+-----+-------+
// |user| dt|added|removed|
// +----+----------+-----+-------+
// | C|2018-03-25| 0| 10|
// | B|2018-03-25| 0| 1|
// | B|2018-03-24| 3| 0|
// | A|2018-03-22| 5| 0|
// | A|2018-03-24| 1| 5|
// +----+----------+-----+-------+

Aggregation the derived column spark

DF.groupBy("id")
.agg(
sum((when(upper($"col_name") === "text", 1)
.otherwise(0)))
.alias("df_count")
.when($"df_count"> 1, 1)
.otherwise(0)
)
Can I do aggregation on the column which was named as alias? ,i.e if the sum is greater than one then return 1 else 0
Thanks in advance.
I think you could wrap another when.otherwise around the sum result:
val df = Seq((1, "a"), (1, "a"), (2, "b"), (3, "a")).toDF("id", "col_name")
df.show
+---+--------+
| id|col_name|
+---+--------+
| 1| a|
| 1| a|
| 2| b|
| 3| a|
+---+--------+
df.groupBy("id").agg(
sum(when(upper($"col_name") === "A", 1).otherwise(0)).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 2|
| 3| 1|
| 2| 0|
+---+--------+
df.groupBy("id").agg(
when(sum(when(upper($"col_name")==="A", 1).otherwise(0)) > 1, 1).otherwise(0).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 1|
| 3| 0|
| 2| 0|
+---+--------+