Is there a good way to solve this problem: Suppose I want to select records that are at least 6 months prior to the previously selected record for a given grouping.
Ie. I have:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-10-01 00:00:00
1 A 2014-05-01 00:00:00
1 A 2014-01-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-10-01 00:00:00
2 A 2014-01-01 00:00:00
2 A 2013-10-01 00:00:00
I'd like to select only dates that are at least 6 months apart relative to the previously selected one. Ie it will return:
Col A Col B Date
1 A 2015-01-01 00:00:00
1 A 2014-05-01 00:00:00
1 B 2014-01-01 00:00:00
2 A 2015-01-01 00:00:00
2 A 2014-01-01 00:00:00
It is obvious to me how to do this using orderings if you wanted to select relative to the latest ones
(ie:
SELECT b.date, b..., a.latest_date
FROM(
SELECT *, row_number OVER PARTITION BY Col A, Col B ORDER BY Date as row_number
FROM table1) temp
WHERE row_number = 1) a
INNER JOIN TABLE 1 b
ON KEY)
WHERE datediff(date, latestdate)/365 > 0.5
or so
, but I'm a little unclear how you would do this relative to each other. Is there a way to do this recursively in Hive / Scala or something?
Hi there is a concept of windowing lag and lead concept is both hive and spark, you can achieve this task in both. Here is the code in spark.
val data = sc.parallelize(List(("1", "A", "2015-01-01 00:00:00"),
| | ("1", "A", "2014-10-01 00:00:00"),
| | ("1", "A", "2014-01-01 00:00:00"),
| | ("1", "B", "2014-01-01 00:00:00"),
| | ("2", "A", "2015-01-01 00:00:00"),
| | ("2", "A", "2014-10-01 00:00:00"),
| | ("2", "A", "2014-01-01 00:00:00"),
| | ("2", "A", "2013-10-01 00:00:00")
| | )).toDF("id","status","Date");
val data2 =data.select($"id",$"status",to_date($"Date").alias(date));
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec3 = Window.partitionBy("id","status").orderBy(desc("date"));
val data3 = data2.withColumn("diff",datediff(lag(data2("date"), 1).over(wSpec3),$"date")).filter($"diff">182.5 || $"diff".isNull);
data3.show
+---+------+----------+----+
| id|status| date|diff|
+---+------+----------+----+
| 1| A|2015-01-01|null|
| 1| A|2014-01-01| 273|
| 1| B|2014-01-01|null|
| 2| A|2015-01-01|null|
| 2| A|2014-01-01| 273|
+---+------+----------+----+
Related
I am looking for help in joining 2 DF's with conditional join in time columns, using Spark Scala.
DF1
time_1
revision
review_1
2022-04-05 08:32:00
1
abc
2022-04-05 10:15:00
2
abc
2022-04-05 12:15:00
3
abc
2022-04-05 09:00:00
1
xyz
2022-04-05 20:20:00
2
xyz
DF2:
time_2
review_1
value
2022-04-05 08:30:00
abc
value_1
2022-04-05 09:48:00
abc
value_2
2022-04-05 15:40:00
abc
value_3
2022-04-05 08:00:00
xyz
value_4
2022-04-05 09:00:00
xyz
value_5
2022-04-05 10:00:00
xyz
value_6
2022-04-05 11:00:00
xyz
value_7
2022-04-05 12:00:00
xyz
value_8
Desired Output DF:
time_1
revision
review_1
value
2022-04-05 08:32:00
1
abc
value_1
2022-04-05 10:15:00
2
abc
value_2
2022-04-05 12:15:00
3
abc
null
2022-04-05 09:00:00
1
xyz
value_6
2022-04-05 20:20:00
2
xyz
null
As in the case of row 4 of the final output (where time_1 = 2022-04-05 09:00:00, if multiple values match during the join then only the latest - in time - should be taken).
Furthermore if there is no match for a row of df in the join then there it should have null for the value column.
Here we need to join between 2 columns in the two DF's:
review_1 === review_2 &&
time_1 === time_2 (condition : time_1 should be in range +1/-1 Hr from time_2, If multiple records then show latest value, as in value_6 above)
Here is the code necessary to join the DataFrames:
I have commented the code so as to explain the logic.
TL;DR
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
Full breakdown
Let's start off with your DataFrames: df1 and df2 in code:
val df1 = List(
("2022-04-05 08:32:00", 1, "abc"),
("2022-04-05 10:15:00", 2, "abc"),
("2022-04-05 12:15:00", 3, "abc"),
("2022-04-05 09:00:00", 1, "xyz"),
("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")
df1.show(false)
gives:
+-------------------+--------+--------+
|time_1 |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1 |abc |
|2022-04-05 10:15:00|2 |abc |
|2022-04-05 12:15:00|3 |abc |
|2022-04-05 09:00:00|1 |xyz |
|2022-04-05 20:20:00|2 |xyz |
+-------------------+--------+--------+
val df2 = List(
("2022-04-05 08:30:00", "abc", "value_1"),
("2022-04-05 09:48:00", "abc", "value_2"),
("2022-04-05 15:40:00", "abc", "value_3"),
("2022-04-05 08:00:00", "xyz", "value_4"),
("2022-04-05 09:00:00", "xyz", "value_5"),
("2022-04-05 10:00:00", "xyz", "value_6"),
("2022-04-05 11:00:00", "xyz", "value_7"),
("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")
df2.show(false)
gives:
+-------------------+--------+-------+
|time_2 |review_2|value |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc |value_1|
|2022-04-05 09:48:00|abc |value_2|
|2022-04-05 15:40:00|abc |value_3|
|2022-04-05 08:00:00|xyz |value_4|
|2022-04-05 09:00:00|xyz |value_5|
|2022-04-05 10:00:00|xyz |value_6|
|2022-04-05 11:00:00|xyz |value_7|
|2022-04-05 12:00:00|xyz |value_8|
+-------------------+--------+-------+
Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:
// add a new column, temporarily, which contains the time in
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))
df1WithEpoch.show()
df2WithEpoch.show()
gives:
+-------------------+--------+--------+------------+
| time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00| 1| abc| 1649147520|
|2022-04-05 10:15:00| 2| abc| 1649153700|
|2022-04-05 12:15:00| 3| abc| 1649160900|
|2022-04-05 09:00:00| 1| xyz| 1649149200|
|2022-04-05 20:20:00| 2| xyz| 1649190000|
+-------------------+--------+--------+------------+
+-------------------+--------+-------+------------+
| time_2|review_2| value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00| abc|value_1| 1649147400|
|2022-04-05 09:48:00| abc|value_2| 1649152080|
|2022-04-05 15:40:00| abc|value_3| 1649173200|
|2022-04-05 08:00:00| xyz|value_4| 1649145600|
|2022-04-05 09:00:00| xyz|value_5| 1649149200|
|2022-04-05 10:00:00| xyz|value_6| 1649152800|
|2022-04-05 11:00:00| xyz|value_7| 1649156400|
|2022-04-05 12:00:00| xyz|value_8| 1649160000|
+-------------------+--------+-------+------------+
and finally to join:
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
gives:
+-------------------+--------+--------+-------+
|time_1 |revision|review_1|value |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1 |abc |value_1|
|2022-04-05 10:15:00|2 |abc |value_2|
|2022-04-05 12:15:00|3 |abc |null |
|2022-04-05 09:00:00|1 |xyz |value_6|
|2022-04-05 20:20:00|2 |xyz |null |
+-------------------+--------+--------+-------+
I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month
You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+
I'm new to Scala, say I have a dataset :
>>> ds.show()
+--------------+-----------------+-------------+
|year |nb_product_sold | system_year |
+--------------+-----------------+-------------+
|2010 | 1 | 2012 |
|2012 | 2 | 2012 |
|2012 | 4 | 2012 |
|2015 | 3 | 2012 |
|2019 | 4 | 2012 |
|2021 | 5 | 2012 |
+--------------+-----------------+-------+
and I have a List<Integer> years = {1, 3, 8}, which means the x year after system_year year.
The goal is to calculate the number of total sold products for each year after system_year.
In other words, I have to calculate the total sold products for year 2013, 2015, 2020.
The output dataset should be like this :
+-------+-----------------------+
| year | total_product_sold |
+-------+-----------------------+
| 1 | 6 | -> 2012 - 2013 6 products sold
| 3 | 9 | -> 2012 - 2015 9 products sold
| 8 | 13 | -> 2012 - 2020 13 products sold
+-------+-----------------------+
I want to know how to do this in scala ? Should I use groupBy() in this case ?
You could have used a groupby case/when if the year ranges didn't overlap. But here you'll need to do a groupby for each year and then union the 3 grouped dataframes :
val years = List(1, 3, 8)
val result = years.map{ y =>
df.filter($"year".between($"system_year", $"system_year" + y))
.groupBy(lit(y).as("year"))
.agg(sum($"nb_product_sold").as("total_product_sold"))
}.reduce(_ union _)
result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//| 1| 6|
//| 3| 9|
//| 8| 13|
//+----+------------------+
There might be multiple ways of doing things and more efficient than what I am showing you but it works for your use case.
//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year")
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")
you can see the output as below:
While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year.
I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber,
For example, running:
spark.sql("select year('2017-01-01') as year, weekofyear('2017-01-01') as weeknumber").show(false)
outputs:
+----+----------+
|year|weeknumber|
+----+----------+
|2017|52 |
+----+----------+
But the wanted output would be:
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+
and running:
spark.sql("select year('2018-12-31') as year, weekofyear('2018-12-31') as weeknumber").show(false)
produces:
+----+----------+
|year|weeknumber|
+----+----------+
|2018|1 |
+----+----------+
But what is expected is:
+----+----------+
|year|weeknumber|
+----+----------+
|2019|1 |
+----+----------+
Code is running on Spark 2.4.2.
This spark behavior is consistent with the ISO 8601 definition. You can not change it. However there is a workaround I could think of.
You can first determine dayOfWeek, and if it is less than 4, you increase the year by one, if it equals to 4 then keep the year untouched. Otherwise decrease the year by one.
Example with 2017-01-01
sql("select case when date_format('2017-01-01', 'u') < 4 then year('2017-01-01')+1 when date_format('2017-01-01', 'u') = 4 then year('2017-01-01') else year('2017-01-01')- 1 end as year, weekofyear('2017-01-01') as weeknumber, date_format('2017-01-01', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2016|52 |7 |
+----+----------+---------+
Example with 2018-12-31
sql("select case when date_format('2018-12-31', 'u') < 4 then year('2018-12-31')+1 when date_format('2018-12-31', 'u') = 4 then year('2018-12-31') else year('2018-12-31')- 1 end as year, weekofyear('2018-12-31') as weeknumber, date_format('2018-12-31', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2019|1 |1 |
+----+----------+---------+
val df = Seq(("2017-01-01"), ("2018-12-31")).toDF("dateval")
+----------+
| dateval|
+----------+
|2017-01-01|
|2018-12-31|
+----------+
df.createOrReplaceTempView("date_tab")
val newDF = spark.sql("""select dateval,
case when weekofyear(dateval)=1 and month(dateval)=12 then struct((year(dateval)+1) as yr, weekofyear(dateval) as wk)
when weekofyear(dateval)=52 and month(dateval)=1 then struct((year(dateval)-1) as yr, weekofyear(dateval) as wk)
else struct((year(dateval)) as yr, weekofyear(dateval) as wk) end as week_struct
from date_tab""");
newDF.select($"dateval", $"week_struct.yr", $"week_struct.wk").show()
+----------+----+---+
| dateval| yr| wk|
+----------+----+---+
|2017-01-01|2016| 52|
|2018-12-31|2019| 1|
+----------+----+---+
You can also use a UDF to achieve this
import org.apache.spark.sql.types._
import java.time.temporal.IsoFields
def weekYear(date: java.sql.Date) : Option[Int] = {
if(date == null) None
else Some(date.toLocalDate.get(IsoFields.WEEK_BASED_YEAR))
}
Register this udf as
spark.udf.register("yearOfWeek", weekYear _)
Result:-
scala> spark.sql("select yearOfWeek('2017-01-01') as year, WEEKOFYEAR('2017-01-01') as weeknumber").show(false)
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+
I have a Spark DataFrame of customers as shown below.
#SparkR code
customers <- data.frame(custID = c("001", "001", "001", "002", "002", "002", "002"),
date = c("2017-02-01", "2017-03-01", "2017-04-01", "2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01"),
value = c('new', 'good', 'good', 'new', 'good', 'new', 'bad'))
customers <- createDataFrame(customers)
display(customers)
custID| date | value
--------------------------
001 | 2017-02-01| new
001 | 2017-03-01| good
001 | 2017-04-01| good
002 | 2017-01-01| new
002 | 2017-02-01| good
002 | 2017-03-01| new
002 | 2017-04-01| bad
In the first month observation for a custID the customer gets a value of 'new'. Thereafter they are classified as 'good' or 'bad'. However, it is possible for a customer to revert from 'good' or 'bad' back to 'new' in the case that they open a second account. When this happens I want to tag the customer with '2' instead of '1', to indicate that they opened a second account, as shown below. How can I do this in Spark? Either SparkR or PySpark commands work.
#What I want to get
custID| date | value | tag
--------------------------------
001 | 2017-02-01| new | 1
001 | 2017-03-01| good | 1
001 | 2017-04-01| good | 1
002 | 2017-01-01| new | 1
002 | 2017-02-01| good | 1
002 | 2017-03-01| new | 2
002 | 2017-04-01| bad | 2
In pyspark:
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()
# df is equal to your customers dataframe
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', header=True, sep='|').cache()
df_new = df.filter(df['value'] == 'new').withColumn('tag', f.rank().over(Window.partitionBy('custID').orderBy('date')))
df = df_new.union(df.filter(df['value'] != 'new').withColumn('tag', f.lit(None)))
df = df.withColumn('tag', f.collect_list('tag').over(Window.partitionBy('custID').orderBy('date'))) \
.withColumn('tag', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('tag'))
df.show()
And output:
+------+----------+-----+---+
|custID| date|value|tag|
+------+----------+-----+---+
| 001|2017-02-01| new| 1|
| 001|2017-03-01| good| 1|
| 001|2017-04-01| good| 1|
| 002|2017-01-01| new| 1|
| 002|2017-02-01| good| 1|
| 002|2017-03-01| new| 2|
| 002|2017-04-01| bad| 2|
+------+----------+-----+---+
By the way, pandas can do that easy.
This can be done using the following piece of code:
Filter out all the records with "new"
df_new<-sql("select * from df where value="new")
createOrReplaceTempView(df_new,"df_new")
df_new<-sql("select *,row_number() over(partiting by custID order by date)
tag from df_new")
createOrReplaceTempView(df_new,"df_new")
df<-sql("select custID,date,value,min(tag) as tag from
(select t1.*,t2.tag from df t1 left outer join df_new t2 on
t1.custID=t2.custID and t1.date>=t2.date) group by 1,2,3")