I have to resample the following cell array:
dateS =
'2004-09-02 06:00:00'
'2004-09-02 07:30:00'
'2004-09-02 12:00:00'
'2004-09-02 18:00:00'
'2004-09-02 19:30:00'
'2004-09-03 00:00:00'
'2004-09-03 05:30:00'
'2004-09-03 06:00:00'
following an irregular spacing, e.g. between 1st and 2nd rows there are 5 readings, while between 2 and 3rd there are 10. The number of intermediates 'readings' are stored in a vector 'v'. So, what I need is a new vector with all the intermediate dates/times in the same format at dateS.
EDIT:
There's 1h30min = 90min between the first 2 readings in the list. Five intervals b/w them amounts to 90 mins / 5 = 18 mins. Now insert five 'readings' between (1) and (2), each separated by 18mins. I need to do that for all the dateS.
Any ideas? Thanks!
You can interpolate the serial dates with interp1():
% Inputs
dates = [
'2004-09-02 06:00:00'
'2004-09-02 07:30:00'
'2004-09-02 12:00:00'
'2004-09-02 18:00:00'
'2004-09-02 19:30:00'
'2004-09-03 00:00:00'
'2004-09-03 05:30:00'
'2004-09-03 06:00:00'];
v = [5 4 3 2 4 5 3];
% Serial dates
serdates = datenum(dates,'yyyy-mm-dd HH:MM:SS');
% Interpolate
x = cumsum([1 v]);
resampled = interp1(x, serdates, x(1):x(end))';
The result:
datestr(resampled)
ans =
02-Sep-2004 06:00:00
02-Sep-2004 06:18:00
02-Sep-2004 06:36:00
02-Sep-2004 06:54:00
02-Sep-2004 07:12:00
02-Sep-2004 07:30:00
02-Sep-2004 08:37:30
02-Sep-2004 09:45:00
02-Sep-2004 10:52:30
02-Sep-2004 12:00:00
02-Sep-2004 14:00:00
02-Sep-2004 16:00:00
02-Sep-2004 18:00:00
02-Sep-2004 18:45:00
02-Sep-2004 19:30:00
02-Sep-2004 20:37:30
02-Sep-2004 21:45:00
02-Sep-2004 22:52:30
03-Sep-2004 00:00:00
03-Sep-2004 01:06:00
03-Sep-2004 02:12:00
03-Sep-2004 03:18:00
03-Sep-2004 04:24:00
03-Sep-2004 05:30:00
03-Sep-2004 05:40:00
03-Sep-2004 05:50:00
03-Sep-2004 06:00:00
The following code does what you want (I picked arbitrary values for v - as long as the number of elements in vector v is one less than the number of entries in dateS this should work):
dateS = [
'2004-09-02 06:00:00'
'2004-09-02 07:30:00'
'2004-09-02 12:00:00'
'2004-09-02 18:00:00'
'2004-09-02 19:30:00'
'2004-09-03 00:00:00'
'2004-09-03 05:30:00'
'2004-09-03 06:00:00'];
% "stations":
v = [6 5 4 3 5 6 4];
dn = datenum(dateS);
df = diff(dn)'./v;
newDates = [];
for ii = 1:numel(v)
newDates = [newDates dn(ii) + (0:v(ii))*df(ii)];
end
newStrings = datestr(newDates, 'yyyy-mm-dd HH:MM:SS');
The array newStrings ends up containing the following: for example, you can see that the interval between the first and second time has been split into 6 15 minute segments
2004-09-02 06:00:00
2004-09-02 06:15:00
2004-09-02 06:30:00
2004-09-02 06:45:00
2004-09-02 07:00:00
2004-09-02 07:15:00
2004-09-02 07:30:00
2004-09-02 08:24:00
2004-09-02 09:18:00
2004-09-02 10:12:00
2004-09-02 11:06:00
2004-09-02 12:00:00
2004-09-02 13:30:00
2004-09-02 15:00:00
2004-09-02 16:30:00
2004-09-02 18:00:00
2004-09-02 18:30:00
2004-09-02 19:00:00
2004-09-02 19:30:00
2004-09-02 20:24:00
2004-09-02 21:18:00
2004-09-02 22:12:00
2004-09-02 23:06:00
2004-09-03 00:00:00
2004-09-03 00:55:00
2004-09-03 01:50:00
2004-09-03 02:45:00
2004-09-03 03:40:00
2004-09-03 04:35:00
2004-09-03 05:30:00
2004-09-03 05:37:30
2004-09-03 05:45:00
2004-09-03 05:52:30
The code relies on a few concepts:
A date can be represented as a string or a datenum. I use built in functions to go between them
Once you have the date/time as a number, it is easy to interpolate
I use the diff function to find the difference between successive times
I don't attempt to "vectorize" the code - you were not asking for efficient code, and for an example like this the clarity of a for loop trumps everything.
Related
given a detailed calendar,
Sunday
hrs_per_day 0
Monday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Tuesday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Wednesday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Thursday
07:00:00 11:59:00 5 hours
13:00:00 15:59:00 3 hours
hrs_per_day 8
Friday
07:00:00 12:59:00 6 hours
hrs_per_day 6
Saturday
hrs_per_day 0
hrs_per_week 38
how can i compute start and end dates of a task based on its duration?
suppose i have a task that can start after Sunday 8 AM, and it will take 23 (8+8+7) hours of work.
then the start date should be Monday 07:00:00, and the end date should be Wednesday 15:00:00.
I can try to find out the dates manually, but not sure how to implement it in a program
function get_start_end_dates(can_start_after, duration_hrs, calendar_data){
// ??????????
return {start_date, end_date}
}
I have a following PySpark dataframe:
year week date time value
2020 1 20201203 2:00 - 2:15 23.9
2020 1 20201203 2:15 - 2:30 45.87
2020 1 20201203 2:30 - 2:45 87.76
2020 1 20201203 2:45 - 3:00 12.87
I want to transpose the time and value column. The desired output should be:
year week date 2:00 - 2:15 2:15 - 2:30 2:30 - 2:45 2:45 - 3:00
2020 1 20201203 23.9 45.87 87.76 12.87
You can use groupby and pivot.
df = df.groupby('year', 'week', 'date').pivot('time').max('value')
I have input pyspark dataframe with columns like ID,StartDatetime,EndDatetime. I want to add new column named newdate based on startdatetime and enddatetime.
Input DF :-
ID StartDatetime EndDatetime
1 21-06-2021 07:00 24-06-2021 16:00
2 21-06-2021 07:00 22-06-2021 16:00
required output :-
ID StartDatetime EndDatetime newdate
1 21-06-2021 07:00 24-06-2021 16:00 21-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 22-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 23-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 24-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 21-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 22-06-2021
You can use explode and array_repeat to duplicate the rows.
I use a combination of row_number and date functions to get the date ranges between start and end dates:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("ID").orderBy('StartDatetime')
output_df = df.withColumn("diff", 1+F.datediff(F.to_date(F.unix_timestamp('EndDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')), \
F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp'))))\
.withColumn('diff', F.expr('explode(array_repeat(diff,int(diff)))'))\
.withColumn("diff", F.row_number().over(w))\
.withColumn("start_dt", F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')))\
.withColumn("newdate", F.date_format(F.expr("date_add(start_dt, diff-1)"), 'dd-MM-yyyy')).drop('diff', 'start_dt')
Output:
output_df.orderBy("ID", "newdate").show()
+---+----------------+----------------+----------+
| ID| StartDatetime| EndDatetime| newdate|
+---+----------------+----------------+----------+
| 1|21-06-2021 07:00|24-06-2021 16:00|21-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|22-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|23-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|24-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|21-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|22-06-2021|
+---+----------------+----------------+----------+
I dropped the diff column, but displaying it will help you understand the logic if it's not clear.
I am trying to use a CASE WHEN statement like below to add 1 day to a timestamp based on the time part of the timestamp:
CASE WHEN to_char(pickup_date, 'HH24:MI') between 0 and 7 then y.pickup_date else dateadd(day,1,y.pickup_date) end as ead_target
pickup_Date is a timestamp with default format YYYY-MM-DD HH:MM:SS
My output
pickup_Date ead_target
2020-07-01 10:00:00 2020-07-01 10:00:00
2020-07-02 3:00:00 2020-07-02 3:00:00
When the hour of the day is between 0 and 7 then ead_target = pickup_Date ELSE add 1 day
Expected output
pickup_Date ead_target
2020-07-01 10:00:00 2020-07-02 10:00:00
2020-07-02 3:00:00 2020-07-02 3:00:00
You will want to use the date_part() function to extract the hour of the day - https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_PART_function.html
Your case statement should work if you extract 'hour' from the timestamp and compare it to the range 0 - 7.
I have delivery slots that has a from column (datetime).
Delivery slots are stored as 1 hour to 1 hour and 30 minute intervals, daily.
i.e. 3.00am-4.30am, 6.00am-7.30am, 9.00am-10.30am and so forth
id | from
------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
3 | 2016-01-01 06:00:00
4 | 2016-01-01 07:30:00
5 | 2016-01-01 09:00:00
6 | 2016-01-01 10:30:00
7 | 2016-01-01 12:00:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
10 | 2016-01-02 06:00:00
11 | 2016-01-02 07:30:00
12 | 2016-01-02 09:00:00
13 | 2016-01-02 10:30:00
14 | 2016-01-02 12:00:00
I’m trying to get all delivery_slots between the hours of 3.00am - 4.30 am. Ive got the following so far:
SELECT * FROM delivery_slots WHERE EXTRACT(HOUR FROM delivery_slots.from) >= 3 AND EXTRACT(MINUTE FROM delivery_slots.from) >= 0 AND EXTRACT(HOUR FROM delivery_slots.from) <= 4 AND EXTRACT(MINUTE FROM delivery_slots.from) <= 30;
Which kinda works. Kinda, because it is only returning delivery slots that have minutes of 00.
Thats because of the last where condition (EXTRACT(MINUTE FROM delivery_slots.from) <= 30)
To give you an idea, of what I am trying to expect:
id | from
-------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
15 | 2016-01-03 03:00:00
16 | 2016-01-03 04:30:00
etc...
Is there a better way to go about this?
Try this: (not tested)
SELECT * FROM delivery_slots WHERE delivery_slots.from::time >= '03:00:00' AND delivery_slots.from::time <= '04:30:00'
Hope this helps.
Cheers.
The easiest way to do this, in my mind, is to cast the from column as a type time and do a where >= and <=, like so
select * from testing where (date::time >= '3:00'::time and date::time <= '4:30'::time);