With 2 pyspark dataframes
id
myTimeStamp
1
2022-06-01 05:00
1
2022-06-06 05:00
2
2022-06-01 05:00
2
2022-06-02 05:00
2
2022-06-03 05:00
2
2022-06-04 08:00
3
2022-06-02 05:00
3
2022-06-04 10:00
myTimeToRemove
2022-06-01 05:00
2022-06-04 05:00
Need to remove the records from first dataframe that contains values on second dataframe by date (doesn't matter the time)
Expected dataframe:
id
myTimeStamp
1
2022-06-06 05:00
2
2022-06-02 05:00
2
2022-06-03 05:00
3
2022-06-02 05:00
I tried
fdcn_df = fdcn_df.join(holidays_df, fdcn_df['myTimeStamp'].cast('date') != holidays_df['myTimeToRemove'].cast('date'),"inner")
but no result
I was expecting
Expected dataframe:
id
myTimeStamp
1
2022-06-06 05:00
2
2022-06-02 05:00
2
2022-06-03 05:00
3
2022-06-02 05:00
What is your data type of myTimeStamp and myTimeToRemove? Assuming it is a timestamp column, you can use leftanti join:
fdcn_df.alias('a').join(
holidays_df.alias('b'),
on=func.to_date(func.col('a.myTimeStamp'))==func.to_date(func.col('b. myTimeToRemove')),
how='leftanti'
).show(100, False)
Related
given a detailed calendar,
Sunday
hrs_per_day 0
Monday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Tuesday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Wednesday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Thursday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Friday
07:00:00 12:59:00 6 hours
hrs_per_day 6
Saturday
hrs_per_day 0
hrs_per_week 38
how can i compute start and end dates of a task based on its duration?
suppose i have a task that can start after Sunday 8 AM, and it will take 23 (8+8+7) hours of work.
then the start date should be Monday 07:00:00, and the end date should be Wednesday 15:00:00.
I can try to find out the dates manually, but not sure how to implement it in a program
function get_start_end_dates(can_start_after, duration_hrs, calendar_data){
// ??????????
return {start_date, end_date}
}
I try to make work schedule table.
I have a table like:
shift_starts_dt
shift_type
2022-01-01 08:00:00
Day
2022-01-01 20:00:00
Night
2022-01-02 08:00:00
Day
2022-01-02 20:00:00
Night
2022-01-03 08:00:00
Day
2022-01-03 20:00:00
Night
2022-01-04 08:00:00
Day
2022-01-04 20:00:00
Night
etc.. until the end of the year
I can't figure out how to add repeated values to table.
I want to add the 'shift_name' column that contains 'A','B','C','D' (It's like name for team)
What query should I use to achieve the next result:
shift_starts_dt
shift_type
shift_name
2022-01-01 08:00:00
Day
'A'
2022-01-01 20:00:00
Night
'B'
2022-01-02 08:00:00
Day
'C'
2022-01-02 20:00:00
Night
'D'
2022-01-03 08:00:00
Day
'A'
2022-01-03 20:00:00
Night
'B'
2022-01-04 08:00:00
Day
'C'
2022-01-04 20:00:00
Night
'D'
. . . . . .
Use number of half days since Jan 1 modulus 4 to index an array:
select
shift_starts_dt,
shift_type,
(array['A','B','C','D'])[(extract(epoch from shift_starts_dt - '2022-01-01')::int / 43200) % 4 + 1]
from work_schedule
See live demo.
You could replace '2022-01-01' with (select min(shift_starts_dt) from work_schedule) for a more general solution.
I'm new to database (even more to postgres), so if you can help me. I have a table something like this:
id_interaction
start_time
end_time
0001
2022-06-03 12:40:10
2022-06-03 12:45:16
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
0004
2022-06-05 23:00:00
2022-06-06 10:30:12
Basically I need to create a query to get the duration doing a separation by hours, for example:
id_interaction
start_time
end_time
hour
duration
0001
2022-06-03 12:40:10
2022-06-03 12:45:16
12:00:00
00:05:06
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
10:00:00
00:09:20
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
11:00:00
00:10:12
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
16:00:00
00:30:00
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
17:00:00
01:00:00
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
18:00:00
00:20:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
23:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
24:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
01:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
02:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
03:00:00
00:30:12
I need all the hours from start to finish. For example: if an id starts at 17:10 and ends at 19:00, I need the duration of 17:00, 18:00 and 19:00
If you're trying to get the duration in each whole hour interval overlapped by your data, this can be achieved by rounding timestamps using date_trunc(), using generate_series() to move around the intervals and casting between time, interval and timestamp:
create or replace function hours_crossed(starts timestamp,ends timestamp)
returns integer
language sql as '
select case
when date_trunc(''hour'',starts)=date_trunc(''hour'',ends)
then 0
when date_trunc(''hour'',starts)=starts
then floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0)
else floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0) +1
end';
select * from (
select
id_interacao,
tempo_inicial,
tempo_final,
to_char(hora, 'HH24:00')::time as hora,
least(tempo_final, hora + '1 hour'::interval)
- greatest(tempo_inicial, hora)
as duracao
from (select
*,
date_trunc('hour',tempo_inicial)
+ (generate_series(0, hours_crossed(tempo_inicial,tempo_final))::text||' hours')::interval
as hora
from test_times
) a
) a
where duracao<>'0'::interval;
This also fixes your first entry that lasts 5 minutes but shows as 45.
You'll need to decide how you want to handle zero-length intervals and ones that end on an exact hour - I added a condition to skip them. Here's a working example.
I have input pyspark dataframe with columns like ID,StartDatetime,EndDatetime. I want to add new column named newdate based on startdatetime and enddatetime.
Input DF :-
ID StartDatetime EndDatetime
1 21-06-2021 07:00 24-06-2021 16:00
2 21-06-2021 07:00 22-06-2021 16:00
required output :-
ID StartDatetime EndDatetime newdate
1 21-06-2021 07:00 24-06-2021 16:00 21-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 22-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 23-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 24-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 21-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 22-06-2021
You can use explode and array_repeat to duplicate the rows.
I use a combination of row_number and date functions to get the date ranges between start and end dates:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("ID").orderBy('StartDatetime')
output_df = df.withColumn("diff", 1+F.datediff(F.to_date(F.unix_timestamp('EndDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')), \
F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp'))))\
.withColumn('diff', F.expr('explode(array_repeat(diff,int(diff)))'))\
.withColumn("diff", F.row_number().over(w))\
.withColumn("start_dt", F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')))\
.withColumn("newdate", F.date_format(F.expr("date_add(start_dt, diff-1)"), 'dd-MM-yyyy')).drop('diff', 'start_dt')
Output:
output_df.orderBy("ID", "newdate").show()
+---+----------------+----------------+----------+
| ID| StartDatetime| EndDatetime| newdate|
+---+----------------+----------------+----------+
| 1|21-06-2021 07:00|24-06-2021 16:00|21-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|22-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|23-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|24-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|21-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|22-06-2021|
+---+----------------+----------------+----------+
I dropped the diff column, but displaying it will help you understand the logic if it's not clear.
I have two tables, the first table has columns: id, start_time, and end_time. The second table has columns: id, timestamp, value. Is there a way to make a sum of table 2 based on the conditions in table 1?
Table 1:
id
start_date
end_date
5
2000-01-01 01:00:00
2000-01-05 02:45:00
5
2000-01-10 01:00:00
2000-01-15 02:45:00
6
2000-01-01 01:00:00
2000-01-05 02:45:00
6
2000-01-11 01:00:00
2000-01-12 02:45:00
6
2000-01-15 01:00:00
2000-01-20 02:45:00
Table 2:
id
timestamp
value
5
2000-01-01 05:00:00
1
5
2000-01-01 06:00:00
2
6
2000-01-01 05:00:00
1
6
2000-01-11 05:00:00
2
6
2000-01-15 05:00:00
2
6
2000-01-15 05:30:00
2
Desired result:
id
start_date
end_date
Sum
5
2000-01-01 01:00:00
2000-01-05 02:45:00
3
5
2000-01-10 01:00:00
2000-01-15 02:45:00
null
6
2000-01-01 01:00:00
2000-01-05 02:45:00
1
6
2000-01-11 01:00:00
2000-01-12 02:45:00
2
6
2000-01-15 01:00:00
2000-01-20 02:45:00
4
Try this :
SELECT a.id, a.start_date, a.end_date, sum(b.value) AS sum
FROM table1 AS a
LEFT JOIN table2 AS b
ON b.id = a.id
AND b.timestamp >= a.start_date
AND b.timestamp < a.end_date
GROUP BY a.id, a.start_date, a.end_date