I am looking for help in joining 2 DF's with conditional join in time columns, using Spark Scala.
DF1
time_1
revision
review_1
2022-04-05 08:32:00
1
abc
2022-04-05 10:15:00
2
abc
2022-04-05 12:15:00
3
abc
2022-04-05 09:00:00
1
xyz
2022-04-05 20:20:00
2
xyz
DF2:
time_2
review_1
value
2022-04-05 08:30:00
abc
value_1
2022-04-05 09:48:00
abc
value_2
2022-04-05 15:40:00
abc
value_3
2022-04-05 08:00:00
xyz
value_4
2022-04-05 09:00:00
xyz
value_5
2022-04-05 10:00:00
xyz
value_6
2022-04-05 11:00:00
xyz
value_7
2022-04-05 12:00:00
xyz
value_8
Desired Output DF:
time_1
revision
review_1
value
2022-04-05 08:32:00
1
abc
value_1
2022-04-05 10:15:00
2
abc
value_2
2022-04-05 12:15:00
3
abc
null
2022-04-05 09:00:00
1
xyz
value_6
2022-04-05 20:20:00
2
xyz
null
As in the case of row 4 of the final output (where time_1 = 2022-04-05 09:00:00, if multiple values match during the join then only the latest - in time - should be taken).
Furthermore if there is no match for a row of df in the join then there it should have null for the value column.
Here we need to join between 2 columns in the two DF's:
review_1 === review_2 &&
time_1 === time_2 (condition : time_1 should be in range +1/-1 Hr from time_2, If multiple records then show latest value, as in value_6 above)
Here is the code necessary to join the DataFrames:
I have commented the code so as to explain the logic.
TL;DR
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
Full breakdown
Let's start off with your DataFrames: df1 and df2 in code:
val df1 = List(
("2022-04-05 08:32:00", 1, "abc"),
("2022-04-05 10:15:00", 2, "abc"),
("2022-04-05 12:15:00", 3, "abc"),
("2022-04-05 09:00:00", 1, "xyz"),
("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")
df1.show(false)
gives:
+-------------------+--------+--------+
|time_1 |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1 |abc |
|2022-04-05 10:15:00|2 |abc |
|2022-04-05 12:15:00|3 |abc |
|2022-04-05 09:00:00|1 |xyz |
|2022-04-05 20:20:00|2 |xyz |
+-------------------+--------+--------+
val df2 = List(
("2022-04-05 08:30:00", "abc", "value_1"),
("2022-04-05 09:48:00", "abc", "value_2"),
("2022-04-05 15:40:00", "abc", "value_3"),
("2022-04-05 08:00:00", "xyz", "value_4"),
("2022-04-05 09:00:00", "xyz", "value_5"),
("2022-04-05 10:00:00", "xyz", "value_6"),
("2022-04-05 11:00:00", "xyz", "value_7"),
("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")
df2.show(false)
gives:
+-------------------+--------+-------+
|time_2 |review_2|value |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc |value_1|
|2022-04-05 09:48:00|abc |value_2|
|2022-04-05 15:40:00|abc |value_3|
|2022-04-05 08:00:00|xyz |value_4|
|2022-04-05 09:00:00|xyz |value_5|
|2022-04-05 10:00:00|xyz |value_6|
|2022-04-05 11:00:00|xyz |value_7|
|2022-04-05 12:00:00|xyz |value_8|
+-------------------+--------+-------+
Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:
// add a new column, temporarily, which contains the time in
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))
df1WithEpoch.show()
df2WithEpoch.show()
gives:
+-------------------+--------+--------+------------+
| time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00| 1| abc| 1649147520|
|2022-04-05 10:15:00| 2| abc| 1649153700|
|2022-04-05 12:15:00| 3| abc| 1649160900|
|2022-04-05 09:00:00| 1| xyz| 1649149200|
|2022-04-05 20:20:00| 2| xyz| 1649190000|
+-------------------+--------+--------+------------+
+-------------------+--------+-------+------------+
| time_2|review_2| value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00| abc|value_1| 1649147400|
|2022-04-05 09:48:00| abc|value_2| 1649152080|
|2022-04-05 15:40:00| abc|value_3| 1649173200|
|2022-04-05 08:00:00| xyz|value_4| 1649145600|
|2022-04-05 09:00:00| xyz|value_5| 1649149200|
|2022-04-05 10:00:00| xyz|value_6| 1649152800|
|2022-04-05 11:00:00| xyz|value_7| 1649156400|
|2022-04-05 12:00:00| xyz|value_8| 1649160000|
+-------------------+--------+-------+------------+
and finally to join:
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
gives:
+-------------------+--------+--------+-------+
|time_1 |revision|review_1|value |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1 |abc |value_1|
|2022-04-05 10:15:00|2 |abc |value_2|
|2022-04-05 12:15:00|3 |abc |null |
|2022-04-05 09:00:00|1 |xyz |value_6|
|2022-04-05 20:20:00|2 |xyz |null |
+-------------------+--------+--------+-------+
Related
I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
Col1 Col2 Col3 Name
1 40 56 john jones
2 45 55 tracey smith
3 33 23 amy sanders
Expected Output
Col1 Col2 Col3 Name
0.5 1.02 1.25 john jones
1 1.14 1.23 tracey smith
1.5 0.84 0.51 amy sanders
Function as of now. Not working:
#function to divide few columns by the column average and overwrite the column
def avg_scaling(df):
#List of columns which have to be scaled by their average
col_list = ['col1', 'col2', 'col3']
for i in col_list:
df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))
return df
new_df = avg_scaling(df)
You can make use of a Window here partitioned on a pusedo column and run average on that window.
The code goes like this,
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 1| 40| 56| john jones|
| 2| 45| 55|tracey smith|
| 3| 33| 23| amy sanders|
+----+----+----+------------+
from pyspark.sql import Window
def avg_scaling(df, cols_to_scale):
w = Window.partitionBy(F.lit(1))
for col in cols_to_scale:
df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
return df
new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 0.5|1.02|1.25| john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+
I have the following dataframe:
ID Name City
1 Ali swl
2 Sana lhr
3 Ahad khi
4 ABC fsd
And a list of values like (1,2,1):
val nums: List[Int] = List(1, 2, 1)
I want to add these values against ID == 3. So that DataFrame looks like:
ID Name City newCol newCol2 newCol3
1 Ali swl null null null
2 Sana lhr null null null
3 Ahad khi 1 2 1
4 ABC fsd null null null
I wonder if it is possible? Any help will be appreciated. Thanks
Yes, Its possible.
Use when for populating matched values & otherwise for not matched values.
I have used zipWithIndex for making column names unique.
Please check below code.
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)
scala> val filterData = List(3,4)
scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1 |Ali |swl |null |null |null |
|2 |Sana|lhr |null |null |null |
|3 |Ahad|khi |1 |2 |1 |
|4 |ABC |fsd |1 |2 |1 |
+---+----+----+-------+-------+-------+
Time taken: 43 ms
scala>
Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:
import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._
val numsDf =
Seq(nums)
.toDF("nums")
.select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)
After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:
val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")
resultDf.show() will print:
+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
| 1| Ali| swl| null| null| null|
| 2|Sana| lhr| null| null| null|
| 3|Ahad| khi| 1| 2| 3|
| 4| ABC| fsd| null| null| null|
+---+----+----+-------+-------+-------+
Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:
val spark = SparkSession.builder()
...
.config("spark.sql.crossJoin.enabled", value = true)
.getOrCreate()
I'm new to SparkSQL, and I want to calculate the percentage in my data with every status.
Here is my data like below:
A B
11 1
11 3
12 1
13 3
12 2
13 1
11 1
12 2
So,I can do it in SQL like this:
select (C.oneTotal / C.total) as onePercentage,
(C.twoTotal / C.total) as twotPercentage,
(C.threeTotal / C.total) as threPercentage
from (select count(*) as total,
A,
sum(case when B = '1' then 1 else 0 end) as oneTotal,
sum(case when B = '2' then 1 else 0 end) as twoTotal,
sum(case when B = '3' then 1 else 0 end) as threeTotal
from test
group by A) as C;
But in the SparkSQL DataFrame, first I calculate totalCount in every status like below:
// wrong code
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1)).otherwise(0)).alias("oneTotal")
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal")
The problem is the sum(when)'s result is zero.
Do I have wrong use with it?
How to implement it in SparkSQL just like my above SQL? And then calculate the portion of every status?
Thank you for your help. In the end, I solve it with sum(when). below is my current code.
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1).otherwise(0)).alias("oneTotal"),
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal"))
.select(col("total"),
col("A"),
col("oneTotal") / col("total").alias("oneRate"),
col("twoTotal") / col("total").alias("twoRate"))
Thanks again.
you can use sum(when(... or also count(when.., the second option being shorter to write:
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df
.groupBy($"A")
.agg(
count("*").as("total"),
count(when($"B"==="1",$"A")).as("oneTotal"),
count(when($"B"==="2",$"A")).as("twoTotal"),
count(when($"B"==="3",$"A")).as("threeTotal")
)
.select(
$"A",
($"oneTotal"/$"total").as("onePercentage"),
($"twoTotal"/$"total").as("twoPercentage"),
($"threeTotal"/$"total").as("threePercentage")
)
.show()
gives
+---+------------------+------------------+------------------+
| A| onePercentage| twoPercentage| threePercentage|
+---+------------------+------------------+------------------+
| 12|0.3333333333333333|0.6666666666666666| 0.0|
| 13| 0.5| 0.0| 0.5|
| 11|0.6666666666666666| 0.0|0.3333333333333333|
+---+------------------+------------------+------------------+
alternatively, you could produce a "long" table with window-functions:
df
.groupBy($"A",$"B").count()
.withColumn("total",sum($"count").over(Window.partitionBy($"A")))
.select(
$"A",
$"B",
($"count"/$"total").as("percentage")
).orderBy($"A",$"B")
.show()
+---+---+------------------+
| A| B| percentage|
+---+---+------------------+
| 11| 1|0.6666666666666666|
| 11| 3|0.3333333333333333|
| 12| 1|0.3333333333333333|
| 12| 2|0.6666666666666666|
| 13| 1| 0.5|
| 13| 3| 0.5|
+---+---+------------------+
As far as I understood you want to implement the logic like above sql showed in the question.
one way is like below example
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object AggTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df.show(false)
df.createOrReplaceTempView("test")
spark.sql(
"""
|select (C.oneTotal / C.total) as onePercentage,
| (C.twoTotal / C.total) as twotPercentage,
| (C.threeTotal / C.total) as threPercentage
|from (select count(*) as total,
| A,
| sum(case when B = '1' then 1 else 0 end) as oneTotal,
| sum(case when B = '2' then 1 else 0 end) as twoTotal,
| sum(case when B = '3' then 1 else 0 end) as threeTotal
| from test
| group by A) as C
""".stripMargin).show
}
Result :
+---+---+
|A |B |
+---+---+
|11 |1 |
|11 |3 |
|12 |1 |
|13 |3 |
|12 |2 |
|13 |1 |
|11 |1 |
|12 |2 |
+---+---+
+------------------+------------------+------------------+
| onePercentage| twotPercentage| threPercentage|
+------------------+------------------+------------------+
|0.3333333333333333|0.6666666666666666| 0.0|
| 0.5| 0.0| 0.5|
|0.6666666666666666| 0.0|0.3333333333333333|
+------------------+------------------+------------------+
Need to write a row when there is change in "AMT" column for a particular "KEY" group.
Eg :
Scenarios-1: For KEY=2, first change is 90 to 20, So need to write a record with value (20-90).
Similarly the next change for the same key group is 20 to 30.5, So again need to write another record with value (30.5 - 20)
Scenarios-2: For KEY=1, only one record for this KEY group so write as is
Scenarios-3: For KEY=3, Since the same AMT value exists twice, so write once
How can this be implemented ? Using window functions or by groupBy agg functions?
Sample Input Data :
val DF1 = List((1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)).toDF("KEY", "AMT")
DF1.show(false)
+-----+-------------------+
|KEY |AMT |
+-----+-------------------+
|1 |34.6 |
|2 |90.0 |
|2 |90.0 |
|2 |20.0 |----->[ 20.0 - 90.0 = -70.0 ]
|2 |30.5 |----->[ 30.5 - 20.0 = 10.5 ]
|3 |89.0 |
|3 |89.0 |
+-----+-------------------+
Expected Values :
scala> df2.show()
+----+--------------------+
|KEY | AMT |
+----+--------------------+
| 1 | 34.6 |-----> As Is
| 2 | -70.0 |----->[ 20.0 - 90.0 = -70.0 ]
| 2 | 10.5 |----->[ 30.5 - 20.0 = 10.5 ]
| 3 | 89.0 |-----> As Is, with one record only
+----+--------------------+
i have tried to solve it in pyspark not in scala.
from pyspark.sql.functions import lead
from pyspark.sql.window import Window
w1=Window().partitionBy("key").orderBy("key")
DF4 =spark.createDataFrame([(1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)],["KEY", "AMT"])
DF4.createOrReplaceTempView('keyamt')
DF7=spark.sql('select distinct key,amt from keyamt where key in ( select key from (select key,count(distinct(amt))dist from keyamt group by key) where dist=1)')
DF8=DF4.join(DF7,DF4['KEY']==DF7['KEY'],'leftanti').withColumn('new_col',((lag('AMT',1).over(w1)).cast('double') ))
DF9=DF8.withColumn('new_col1', ((DF8['AMT']-DF8['new_col'].cast('double'))))
DF9.withColumn('new_col1', ((DF9['AMT']-DF9['new_col'].cast('double')))).na.fill(0)
DF9.filter(DF9['new_col1'] !=0).select(DF9['KEY'],DF9['new_col1']).union(DF7).orderBy(DF9['KEY'])
Output:
+---+--------+
|KEY|new_col1|
+---+--------+
| 1| 34.6|
| 2| -70.0|
| 2| 10.5|
| 3| 89.0|
+---+--------+
You can implement your logic using window function with combination of when, lead, monotically_increasing_id() for ordering and withColumn api as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("KEY").orderBy("rowNo")
val tempdf = DF1.withColumn("rowNo", monotonically_increasing_id())
tempdf.select($"KEY", when(lead("AMT", 1).over(windowSpec).isNull || (lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")===lit(0.0), $"AMT").otherwise(lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")).show(false)
Is there a good way to solve this problem: Suppose I want to select records that are at least 6 months prior to the previously selected record for a given grouping.
Ie. I have:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-10-01 00:00:00
1 A 2014-05-01 00:00:00
1 A 2014-01-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-10-01 00:00:00
2 A 2014-01-01 00:00:00
2 A 2013-10-01 00:00:00
I'd like to select only dates that are at least 6 months apart relative to the previously selected one. Ie it will return:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-05-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-01-01 00:00:00
It is obvious to me how to do this using orderings if you wanted to select relative to the latest ones
(ie:
SELECT b.date, b..., a.latest_date
FROM(
SELECT *, row_number OVER PARTITION BY Col A, Col B ORDER BY Date as row_number
FROM table1) temp
WHERE row_number = 1) a
INNER JOIN TABLE 1 b
ON KEY)
WHERE datediff(date, latestdate)/365 > 0.5
or so
, but I'm a little unclear how you would do this relative to each other. Is there a way to do this recursively in Hive / Scala or something?
Hi there is a concept of windowing lag and lead concept is both hive and spark, you can achieve this task in both. Here is the code in spark.
val data = sc.parallelize(List(("1", "A", "2015-01-01 00:00:00"),
| | ("1", "A", "2014-10-01 00:00:00"),
| | ("1", "A", "2014-01-01 00:00:00"),
| | ("1", "B", "2014-01-01 00:00:00"),
| | ("2", "A", "2015-01-01 00:00:00"),
| | ("2", "A", "2014-10-01 00:00:00"),
| | ("2", "A", "2014-01-01 00:00:00"),
| | ("2", "A", "2013-10-01 00:00:00")
| | )).toDF("id","status","Date");
val data2 =data.select($"id",$"status",to_date($"Date").alias(date));
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec3 = Window.partitionBy("id","status").orderBy(desc("date"));
val data3 = data2.withColumn("diff",datediff(lag(data2("date"), 1).over(wSpec3),$"date")).filter($"diff">182.5 || $"diff".isNull);
data3.show
+---+------+----------+----+
| id|status| date|diff|
+---+------+----------+----+
| 1| A|2015-01-01|null|
| 1| A|2014-01-01| 273|
| 1| B|2014-01-01|null|
| 2| A|2015-01-01|null|
| 2| A|2014-01-01| 273|
+---+------+----------+----+