I need to check a Condition whether if ReasonCode is "YES" , then use ProcessDate as one of the PARTITION column else do not.
The equivalent SQL query is below:
SELECT PNum, SUM(SIAmt) OVER (PARTITION BY PNum,
ReasonCode ,
CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END
ORDER BY ProcessDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) SumAmt
from TABLE1
I have tried so far the below query, but unable to incorporate the condition
"CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END" in Spark Dataframes
val df = inputDF.select("PNum")
.withColumn("SumAmt", sum("SIAmt").over(Window.partitionBy("PNum","ReasonCode").orderBy("ProcessDate")))
Input Data:
---------------------------------------
Pnum ReasonCode ProcessDate SIAmt
---------------------------------------
1 No 1/01/2016 200
1 No 2/01/2016 300
1 Yes 3/01/2016 -200
1 Yes 4/01/2016 200
---------------------------------------
Expected Output:
---------------------------------------------
Pnum ReasonCode ProcessDate SIAmt SumAmt
---------------------------------------------
1 No 1/01/2016 200 200
1 No 2/01/2016 300 500
1 Yes 3/01/2016 -200 -200
1 Yes 4/01/2016 200 200
---------------------------------------------
Any Suggestion/help on Spark dataframe instead of spark-sql query ?
You can apply the same exact copy of SQL in api form as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = inputDF
.withColumn("SumAmt", sum("SIAmt").over(Window.partitionBy(col("PNum"),col("ReasonCode"), when(col("ReasonCode") === "Yes", col("ProcessDate")).otherwise(null)).orderBy("ProcessDate")))
You can add the .rowsBetween(Long.MinValue, 0) part too, which should give you
+----+----------+-----------+-----+------+
|Pnum|ReasonCode|ProcessDate|SIAmt|SumAmt|
+----+----------+-----------+-----+------+
| 1| Yes| 4/01/2016| 200| 200|
| 1| No| 1/01/2016| 200| 200|
| 1| No| 2/01/2016| 300| 500|
| 1| Yes| 3/01/2016| -200| -200|
+----+----------+-----------+-----+------+
Related
I have two dataframes which look like below
I am trying to find the diff between two amount based on ID
Dataframe 1:
ID I Amt
1 null 200
null 2 200
3 null 600
dataframe 2
ID I Amt
2 null 300
3 null 400
Output
Df
ID Amt(df2-df1)
2 100
3 -200
Query doesnt work:
Substraction doesnt work
df = df1.join(df2, df1["coalesce(ID, I)"] == df2["coalesce(ID, I)"], 'inner').select
((df1["amt)"]) – (df2["amt”])), df1["coalesce(ID, I)"].show())
I would do a couple of things differently. To make it easier to know what column is in what dataframe, I would rename them. I would also do the coalesce outside of the join itself.
val joined = df1.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF1_AMT")).join(
df2.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF2_AMT")),"joinKey")
Then you can easily perform your calculation:
joined.withColumn("DIFF",$"DF2_AMT" - $"DF1_AMT").show
+-------+-------+-------+------+
|joinKey|DF1_AMT|DF2_AMT| DIFF|
+-------+-------+-------+------+
| 2| 200| 300| 100.0|
| 3| 600| 400|-200.0|
+-------+-------+-------+------+
I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+
I am trying to find the duplicate column value from dataframe in pyspark.
for example, I have a dataframe with single column 'A' with values like below:
==
A
==
1
1
2
3
4
5
5
I am expecting output like below(only duplicate values I need)
==
A
==
1
5
same answer as #Yuva but using the built-in functions :
df = sqlContext.createDataFrame([(1,),(1,),(2,),(3,),(4,),(5,),(5,)],('A',))
df.groupBy("A").count().where("count > 1").drop("count").show()
+---+
| A|
+---+
| 5|
| 1|
+---+
Can you try this, and see if this helps?
df = sqlContext.createDataFrame([(1,),(1,),(2,),(3,),(4,),(5,),(5,)],('A',))
df.createOrReplaceTempView(df_tbl)
spark.sql("select A, count(*) as COUNT from df_tbl group by a having COUNT > 1").show()
+---+-----+
| A|COUNT|
+---+-----+
| 5|2 |
| 1|2 |
+---+-----+
I have 2 spark dataframes, and I want to add new column named "seg" to dataframe df2 based on below condition
if df2.colx value is present in df1.colx.
I tried below operation in pyspark but its throwing exception.
cc002 = df2.withColumn('seg',F.when(df2.colx == df1.colx,"True").otherwise("FALSE"))
df1 :
id colx coly
1 678 56789
2 900 67890
3 789 67854
df2
Name colx
seema 900
yash 678
deep 800
harsh 900
My expected Output is
Name colx seg
seema 900 True
harsh 900 True
yash 678 True
deep 800 False
Please help me correcting the given pyspark code or suggest the better way of doing it.
If I understand your question correctly what you want to do is this
res = df2.join(
df1,
on="colx",
how = "left"
).select(
"Name",
"colx"
).withColumn(
"seg",
F.when(F.col(colx).isNull(),F.lit(True)).otherwise(F.lit(False))
)
let me know if this is the solution you want.
my bad i did write the incorrect code in hurry below is the corrected one
import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([[1,678,56789],[2,900,67890],[3,789,67854]],['id', 'colx', 'coly'])
df2 = sqlContext.createDataFrame([["seema",900],["yash",678],["deep",800],["harsh",900]],['Name', 'colx'])
res = df2.join(
df1.withColumn(
"check",
F.lit(1)
),
on="colx",
how = "left"
).withColumn(
"seg",
F.when(F.col("check").isNotNull(),F.lit(True)).otherwise(F.lit(False))
).select(
"Name",
"colx",
"seg"
)
res.show()
+-----+----+-----+
| Name|colx| seg|
+-----+----+-----+
| yash| 678| true|
|seema| 900| true|
|harsh| 900| true|
| deep| 800|false|
+-----+----+-----+
You can join on colx and fill null values with False:
result = (df2.join(df1.select(df1['colx'], F.lit(True).alias('seg')),
on='colx',
how='left')
.fillna(False, subset='seg'))
result.show()
Output:
+----+-----+-----+
|colx| Name| seg|
+----+-----+-----+
| 900|seema| true|
| 900|harsh| true|
| 800| deep|false|
| 678| yash| true|
+----+-----+-----+
I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11