Joining on date or the next closest date after

Joining on date or the next closest date after - pyspark

I have two tables I'm trying to join on date and account. For table one the field for date is continuous and is type string. Whereas for table two the field date has a portion of the date field as end of month only.
Edit: The sample I provided below is an example of tables I am working with. Originally I had only mentioned about joining the ID and Date fields but there are other fields in both tables that I am trying to keep in the final output. On a larger scale Table 1 has thousands of IDs with records continuously recorded daily across multiple years. In Table 2 this is similar but at some point the date field switched from only end of month data to the same daily date data as table one. The tables below have been updated.
A sample of the data set could be seen as follows:
Table1:
| ID | Date |
| -- | ---------- |
| 1 | "2022-01-30" |
| 1 | "2022-01-31" |
| 1 | "2022-02-01" |
| 1 | "2022-02-02"|
Table2:
| ID | Date |Field_flag|
| ---- | ---------- | ------- |
| 1 | "2021-12-31" |a |
| 1 | "2022-01-31" |a |
| 1 | "2022-02-01" |a |
| 1 | "2022-02-02" |b |
| 1 | "2022-02-03"|b |
Desired result:
| table1.ID | table1.Date | table2.date| table2.Field_flag|
| -------- | ---------- | ---------- | --------------- |
| 1 | "2022-01-30"|"2022-01-31"| a |
| 1 | "2022-01-31"|"2022-01-31"| a |
| 1 | "2022-02-01"|"2022-02-01"| a |
| 1 | "2022-02-02"|"2022-02-02"| b |
Are there any suggestions on how to approach this type of result?
I'm currently temp fielding and sub sampling the date fields into month and year as a temporary solution but would like something like an inner join as such to work.
Select table1.*
,table2.date as date_table2
,table2.Field_flag
FROM table1
INNER JOIN (SELECT * FROM Table2)
ON table1.id = table2.id and (table1.date = table2.date or table1.date < table2.date)

One way to handle this would be to use a correlated subquery to find the same or closest but greater date for each date in the first table.
SELECT
t1.ID,
t1.Date AS Date_t1,
(SELECT t2.Date FROM Table2 t2
WHERE t2.ID = t1.ID AND t2.Date >= t1.Date
ORDER BY t2.Date LIMIT 1) AS Date_t2
FROM Table1 t1
ORDER BY t1.ID, t1.Date;
Demo

This might not be the optimal way but throwing it into the mix. We can join the second table twice - once for the daily dates and once for the month end dates.
data2_2_sdf = spark.sql('''
select *,
datediff(dt, lag(dt) over (partition by id order by dt)) as dt_diff_lag,
datediff(lead(dt) over (partition by id order by dt), dt) as dt_diff_lead
from data2
''')
data2_2_sdf.createOrReplaceTempView('data2_2')
# +---+----------+----+-----------+------------+
# | id| dt|flag|dt_diff_lag|dt_diff_lead|
# +---+----------+----+-----------+------------+
# | 1|2021-12-31| a| null| 31|
# | 1|2022-01-31| a| 31| 1|
# | 1|2022-02-01| a| 1| 1|
# | 1|2022-02-02| b| 1| 1|
# | 1|2022-02-03| b| 1| null|
# +---+----------+----+-----------+------------+
spark.sql('''
select a.*, coalesce(b.dt, c.dt) as dt2, coalesce(b.flag, c.flag) as flag
from data1 a
left join (select * from data2_2 where dt_diff_lag>=28 or dt_diff_lead>=28) b
on a.id=b.id and year(a.dt)*100+month(a.dt)=year(b.dt)*100+month(b.dt)
left join (select * from data2_2 where dt_diff_lag=1 and coalesce(dt_diff_lead, 1)=1) c
on a.id=c.id and a.dt=c.dt
'''). \
show()
# +---+----------+----------+----+
# | id| dt| dt2|flag|
# +---+----------+----------+----+
# | 1|2022-02-01|2022-02-01| a|
# | 1|2022-01-31|2022-01-31| a|
# | 1|2022-01-30|2022-01-31| a|
# | 1|2022-02-02|2022-02-02| b|
# +---+----------+----------+----+

Related

SAS to PySpark conversion using lag function or by other methods equivalent to what to achieve in SAS

I am rewriting legacy SAS codes to PySpark. In one of those blocks, the SAS codes used the lag function. The way I understood the notes, it says an ID is a duplicate if it as two intake dates that are less than 4 days apart.
/*Next create flag if the same ID has two intake dates less than 4 days apart*/
/*Data MUST be sorted by ID and DESCENDING IntakeDate!!!*/
data duplicates (drop= lag_ID lag_IntakeDate);
set df2;
by ID;
lag_ID = lag(ID);
lag_INtakeDate = lag(IntakeDate);
if ID = lag_ID then do;
intake2TIME = intck('day', lag_IntakeDate, IntakeDate);
end;
if 0 <= abs(intake2TIME) < 4 then DUPLICATE = 1;
run;
/* If the DUPLICATE > 1, then it is a duplicate and eventually will be dropped.*/
I tried meeting the condition as described in the comments: I pulled by sql the ID and intake dates ordered by ID and descending intake date:
SELECT ID, intakeDate, col3, col4
from df order by ID, intakeDate DESC
I googled the lag equivalent and this is what I found:
https://www.educba.com/pyspark-lag/
However, I have not used window function before, the concept introduced by the site does not somehow make sense to me, though I tried the following to check if my understanding of WHERE EXISTS might work:
SELECT *
FROM df
WHERE EXISTS (
SELECT *
FROM df v2
WHERE df.ID = v2.ID AND DATEDIFF(df.IntakeDate, v2.IntakeDate) > 4 )
/* not sure about the second condition, though*/)
Initial df
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06|
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08|
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+
expected df will have row dropped if the next intakedate is less than 3 days of the prior date
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06| row to drop
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08| row to drop
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+

Please try the following code:
import pyspark.sql.window as Window
import pyspark.sql.functions as F
lead_over_id = Window.partitionBy('id').orderBy('IntakeDate')
df = (df
.withColumn('lead_1_date', F.lag('IntakeDate', -1).over(lead_over_id))
.withColumn('date_diff', F.datediff('IntakeDate', 'lead_1_date'))
.where((F.col('date_diff') > 4) | F.col('date_diff').isnull())
.drop('lead_1_date', 'date_diff')
)

Ranking a pivoted column

I want to pivot a column and then rank the data from the pivoted column. Here is sample data:
| id | objective | metric | score |
|----|-----------|-------------|-------|
| 1 | Sales | Total Sales | 10 |
| 1 | Marketing | Total Reach | 4 |
| 2 | Sales | Total Sales | 2 |
| 2 | Marketing | Total Reach | 11 |
| 3 | Sales | Total Sales | 9 |
This would be my expected output after pivot + rank:
| id | Sales | Marketing |
|----|--------|-----------|
| 1 | 1 | 2 |
| 2 | 3 | 1 |
| 3 | 2 | 3 |
The ranking is based on sum(score) from each objective. An objective can have also have multiple metrics but that isn't included in the sample for simplicity.
I have been able to successfully pivot and count the scores like so:
pivot = (
spark.table('scoring_table')
.select('id', 'objective', 'metric', 'score')
.groupBy('id')
.pivot('objective')
.agg(
sf.sum('score').alias('score')
)
This then lets me see the total score per objective, but I'm unsure how to rank these. I have tried the following after aggregation:
.withColumn('rank', rank().over(Window.partitionBy('id', 'objective').orderBy(sf.col('score').desc())))
However objective is no longer callable from this point as it has been pivoted. I then tried this instead:
.withColumn('rank', rank().over(Window.partitionBy('id', 'Sales', 'Marketing').orderBy(sf.col('score').desc())))
But also the score column is no longer available. How can I rank these scores after pivoting the data?

You just need to order by the score after pivot:
from pyspark.sql import functions as F, Window
df2 = df.groupBy('id').pivot('objective').agg(F.sum('score')).fillna(0)
df3 = df2.select(
'id',
*[F.rank().over(Window.orderBy(F.desc(c))).alias(c) for c in df2.columns[1:]]
)
df3.show()
+---+---------+-----+
| id|Marketing|Sales|
+---+---------+-----+
| 2| 1| 3|
| 1| 2| 1|
| 3| 3| 2|
+---+---------+-----+

Group values by date in a generated timeseries

I have a table called 'daily_budgets':
+----+------------+----------+--------------+
| id | start_date | end_date | daily_budget |
+----+------------+----------+--------------+
| 1 | 25/04/18 | 29/04/18 | 500 |
+----+------------+----------+--------------+
| 2 | 26/04/18 | 27/04/18 | 1000 |
+----+------------+----------+--------------+
that shows the daily budget that an item has between a designated timeframe (start_date - end_date).
Then I have another one called 'year_2018' where I have generated a time_series with all the dates for 2018:
+----------+
| date |
+----------+
| 01/01/18 |
+----------+
| 02/01/18 |
+----------+
| 03/01/18 |
+----------+
| 04/01/18 |
+----------+
| 05/01/18 |
+----------+
etc.
Now I just want to join those two tables so that I get total daily_budget grouped by date. The first date in that resulting table should be the minimum start_date that there is in table 'daily_budgets'.
+----------+--------------+
| date | daily_budget |
+----------+--------------+
| 25/04/18 | 500 |
+----------+--------------+
| 26/04/18 | 1500 |
+----------+--------------+
| 27/04/18 | 1500 |
+----------+--------------+
| 28/04/18 | 500 |
+----------+--------------+
| 29/04/18 | 500 |
+----------+--------------+
Thanks very much for the help!
I am using: PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.2058

Join the two tables on the condition of the date from your calendar table being in between (inclusive) of the start and end budget dates. Then aggregate by date to generate the totals.
SELECT
c.date,
SUM(daily_budget) AS daily_budget
FROM year_2018 c
INNER JOIN daily_budgets db
ON c.date BETWEEN db.start_date AND db.end_date
GROUP BY
c.date;
Note that I use an inner join here, which will have the effect of filtering off any dates from the calendar table which do not appear in at least one record in the budget table. This should enforce that the earliest date you see reported is also the earliest date in the budget table.

How to query just the last record of every second within a period of time in postgres

I have a table with hundreds of millions of records in 'prices' table with only four columns: uid, price, unit, dt. dt is a datetime in standard format like '2017-05-01 00:00:00.585'.
I can quite easily to select a period using
SELECT uid, price, unit from prices
WHERE dt > '2017-05-01 00:00:00.000'
AND dt < '2017-05-01 02:59:59.999'
What I can't understand how to select price for every last record in each second. (I also need a very first one of each second too, but I guess it will be a similar separate query). There are some similar example (here), but they did not work for me when I try to adapt them to my needs generating errors.
Could some please help me to crack this nut?

Let say that there is a table which has been generated with a help of this command:
CREATE TABLE test AS
SELECT timestamp '2017-09-16 20:00:00' + x * interval '0.1' second As my_timestamp
from generate_series(0,100) x
This table contains an increasing series of timestamps, each timestamp differs by 100 milliseconds (0.1 second) from neighbors, so that there are 10 records within each second.
| my_timestamp |
|------------------------|
| 2017-09-16T20:00:00Z |
| 2017-09-16T20:00:00.1Z |
| 2017-09-16T20:00:00.2Z |
| 2017-09-16T20:00:00.3Z |
| 2017-09-16T20:00:00.4Z |
| 2017-09-16T20:00:00.5Z |
| 2017-09-16T20:00:00.6Z |
| 2017-09-16T20:00:00.7Z |
| 2017-09-16T20:00:00.8Z |
| 2017-09-16T20:00:00.9Z |
| 2017-09-16T20:00:01Z |
| 2017-09-16T20:00:01.1Z |
| 2017-09-16T20:00:01.2Z |
| 2017-09-16T20:00:01.3Z |
.......
The below query determines and prints the first and the last timestamp within each second:
SELECT my_timestamp,
CASE
WHEN rn1 = 1 THEN 'First'
WHEN rn2 = 1 THEN 'Last'
ELSE 'Somwhere in the middle'
END as Which_row_within_a_second
FROM (
select *,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp
) rn1,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp DESC
) rn2
from test
) xx
WHERE 1 IN (rn1, rn2 )
ORDER BY my_timestamp
;
| my_timestamp | which_row_within_a_second |
|------------------------|---------------------------|
| 2017-09-16T20:00:00Z | First |
| 2017-09-16T20:00:00.9Z | Last |
| 2017-09-16T20:00:01Z | First |
| 2017-09-16T20:00:01.9Z | Last |
| 2017-09-16T20:00:02Z | First |
| 2017-09-16T20:00:02.9Z | Last |
| 2017-09-16T20:00:03Z | First |
| 2017-09-16T20:00:03.9Z | Last |
| 2017-09-16T20:00:04Z | First |
| 2017-09-16T20:00:04.9Z | Last |
| 2017-09-16T20:00:05Z | First |
| 2017-09-16T20:00:05.9Z | Last |
A working demo you can find here

T-SQL (Access) remove all duplicates but one with highest ID

I have problem with duplicates in my base but I can't use distinct select to other table becouse I have unique data in some colums. I want to keep last rating.
Example:
ID| ProductName | Code | Rating
------|------ | ------ | ------
1| Bag | 1122 | 5
2| Car| 1133 | 2
3| Bag | 1122 | 3
4| Car | 1133 | 1
5| Train| 1144 | 1
As result of query I want to get:
ID| ProductName | Code | Rating
------|------ | ------ | ------
3| Bag | 1122 | 3
4| Car | 1133 | 1
5| Train| 1144 | 1

One option uses a GROUP BY to identify the id values of the most recent duplicates for each Code/ProductName group:
SELECT t1.*
FROM yourTable t1
INNER JOIN
(
SELECT Code, MAX(ID) AS ID
FROM yourTable
GROUP BY Code
) t2
ON t1.Code = t2.Code AND
t1.ID = t2.ID

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Joining on date or the next closest date after - pyspark

Related

SAS to PySpark conversion using lag function or by other methods equivalent to what to achieve in SAS

Ranking a pivoted column

Group values by date in a generated timeseries

How to query just the last record of every second within a period of time in postgres

T-SQL (Access) remove all duplicates but one with highest ID

Categories

Resources