For a dataframe given below, i want a new column in dataframe which should have constant value of sum of freq column.
+------+----+
|number|freq|
+------+----+
| 8| 1|
| 6| 2|
| 2| 4|
+------+----+
The result should look like
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
| 8| 1| 7|
| 6| 2| 7|
| 2| 4| 7|
+------+----+-------+
and i want this without groupBy or agg.
I tried :
var x = sum(df("freq"))
df.withColumn("new_col",lit(x))
or
df.withColumn("new_col",x)
or
df.withColumn("new_col",sum($"freq"))
But none worked.
You can try this but be careful, it uses a single partition :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(8,1),
(6,2),
(2,4)
).toDF("number","freq")
df.withColumn("new_col", sum($"freq").over())
.show(false)
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
|8 |1 |7 |
|6 |2 |7 |
|2 |4 |7 |
+------+----+-------+
You could use a window over the entire dataframe to do that but I highly recommend not to do it for all the data would need to go to only one partition which would be terrible in terms of performance.
A simple way to do it, very similar to your 1st approach, is:
import org.apache.spark.sql.Row
val Row(x) = df.select(sum('freq)).head
val new_df = df.withColumn("new_col", lit(x))
I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))
I have two data sources, both of which have opinions about the current state of the same set of entities. Either data source may contain the most current data, which may or may not be from the current date. For example:
val df1 = Seq((1, "green", "there", "2018-01-19"), (2, "yellow", "there", "2018-01-18"), (4, "yellow", "here", "2018-01-20")).toDF("id", "status", "location", "date")
val df2 = Seq((2, "red", "here", "2018-01-20"), (3, "green", "there", "2018-01-20"), (4, "green", "here", "2018-01-19")).toDF("id", "status", "location", "date")
df1.show
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2|yellow| there|2018-01-18|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
df2.show
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4| green| here|2018-01-19|
+---+------+--------+----------+
I want the output to be the set of most current states for each entity:
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
My approach, which seems to work, is to join the two tables and then do a kind of custom coalesce operation based on date:
val joined = df1.join(df2, df1("id") === df2("id"), "outer")
+----+------+--------+----------+----+------+--------+----------+
| id|status|location| date| id|status|location| date|
+----+------+--------+----------+----+------+--------+----------+
| 1| green| there|2018-01-19|null| null| null| null|
|null| null| null| null| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20| 4|yellow| here|2018-01-20|
| 2|yellow| there|2018-01-18| 2| red| here|2018-01-20|
+----+------+--------+----------+----+------+--------+----------+
val weirdCoal(name: String) = when(df1("date") > df2("date") || df2("date").isNull, df1(name)).otherwise(df2(name)) as name
val ouput = joined.select(df1.columns.map(weirdCoal):_*)
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
Which is the output I expect.
I can also see doing this via some kind of union / aggregation approach or with a window that partitions by id and sorts by date and takes the last row.
My question: is there an idiomatic way of doing this?
Yes it can be done without join using Window functions:
df1.union(df2)
.withColumn("rank", rank().over(Window.partitionBy($"id").orderBy($"date".desc)))
.filter($"rank" === 1)
.drop($"rank")
.orderBy($"id")
.show
output:
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
the above code partitions the data by id and finds the top date among all dates falling under same id.
I have two DataFrames. One is a MasterList, the other is an InsertList
MasterList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
+--------+--------+
InsertList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 9|
+--------+--------+
In Scala, how do I join two DataFrames but only append to the new DataFrame records
WHERE MasterList.ttm_id = InsertList.ttm_id AND
MasterList.audit_id != InsertList.audit_id
-
ExpectedOutput:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
| 15| 9|
+--------+--------+
I'd anti join (NOT IN) by both columns and union
val masterList = Seq((1, 10), (15, 10)).toDF("ttm_id", "audit_id")
val insertList = Seq((1, 10), (15, 9)).toDF("ttm_id", "audit_id")
insertList
.join(masterList, Seq("ttm_id", "audit_id"), "leftanti")
.union(masterList)
.show
// +------+--------+
// |ttm_id|audit_id|
// +------+--------+
// | 15| 9|
// | 1| 10|
// | 15| 10|
// +------+--------+
It seems that you want to merge rows from insertList dataFrame that are not in masterList dataFrame. This can be achived using except function
insertList.except(masterList)
And you just use union function merge both dataFrames as
masterList.union(insertList.except(masterList))
You should get what you desire as
+------+--------+
|ttm_id|audit_id|
+------+--------+
|1 |10 |
|15 |10 |
|15 |9 |
+------+--------+
I have a dataFrame something like below.
+---+---+-----+
|uId| Id| sum |
+---+---+-----+
| 3| 1| 1.0|
| 7| 1| 1.0|
| 1| 2| 3.0|
| 1| 1| 1.0|
| 6| 5| 1.0|
using above DataFrame, I want to generate new DataFrame mention below
Sum column should be :-
For example:
For uid=3 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=7 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=1 and id=2, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
3.0*1/1=3.0
For uid=6 and id=5, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/1=1.0
My final output should be:
+---+---+---------+
|uId| Id| sum |
+---+---+---------+
| 3| 1| 0.33333|
| 7| 1| 0.33333|
| 1| 2| 3.0 |
| 1| 1| 0.3333 |
| 6| 5| 1.0 |
You can use Window function to get the count of each group of id column and finally use that count to divide the original sum
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("id")
import org.apache.spark.sql.functions._
df.withColumn("sum", $"sum"/count("id").over(windowSpec))
you should have the final dataframe as
+---+---+------------------+
|uId|Id |sum |
+---+---+------------------+
|3 |1 |0.3333333333333333|
|7 |1 |0.3333333333333333|
|1 |1 |0.3333333333333333|
|6 |5 |1.0 |
|1 |2 |3.0 |
+---+---+------------------+