Bin dates without aggregation in Pandas - date

I have a date column in a Pandas.DataFrame:
date Value
2014-02-27 0
2014-08-15 1
2015-04-11 1
2014-09-01 2
I need a function to create a new column that identifies what quarter, month, etc. a record belongs to as shown below:
get_date_bucket(date_var='date', frequency='Q')
date Value date_bucket
2014-02-27 0 2014-03-31
2014-08-15 1 2014-09-30
2015-04-11 1 2015-06-30
2014-09-01 2 2014-09-30
or...
get_date_bucket(date_var='date', frequency='M')
date Value date_bucket
2014-02-27 0 2014-02-29
2014-08-15 1 2014-08-31
2015-04-11 1 2015-04-30
2014-09-01 2 2014-09-30
The data is reasonably large and I don't want to do any aggregation if I can avoid it. What is the simplest way to create the 'date_bucket' column from the 'date' column on the left?

easiest way is using a pd.offset
df['date_bucket'] = df.date + pd.offsets.QuarterEnd()
df
To generalize to any frequency specified by a string
from pandas.tseries.frequencies import to_offset
df.date + to_offset('Q')

Related

Epoch time difference when dates vary not showing correct value

SELECT
r.order_id
,c.order_time
,r.pickup_time
,EXTRACT(epoch FROM (r.pickup_time - c.order_time) / 60) AS diff
FROM runner_orders1 r
JOIN customer_orders1 c ON c.order_id = r.order_id
WHERE distance != 0;
order_id
order_time
pickup_time
diff
1
2020-01-01 18:05:02
2020-01-01 18:15:34
10.533333
2
2020-01-01 19:00:52
2020-01-01 19:10:54
10.033333
3
2020-01-02 23:51:23
2020-01-03 00:12:37
21.233333
3
2020-01-02 23:51:23
2020-01-03 00:12:37
21.233333
4
2020-01-04 13:23:46
2020-01-04 13:53:03
29.283333
4
2020-01-04 13:23:46
2020-01-04 13:53:03
29.283333
Here is my above sql output if you see when the timestamp days differ , the value is not correct. Please check and help.
3 | 2020-01-02 23:51:23 | 2020-01-03 00:12:37 | 21.233333
Your query is returning exactly the correct results (at least to the 6 decimal digits). The problem seems to stem from your expectations. Apparently, you are looking for diff to be minuets and seconds, however, this is not what you are getting. By extracting epoch and dividing by 60 your results are in minuets and fractional part of a minuet. In the selected the difference in time between pickup_time and order_time is 21:14 (min:sec) the result returned turns out to be 21:13.99998 (min:sec.fractional seconds). This is just another reason using epoch is not really a good practice (see Epoch Mania for others).
Correct the result by just subtracting the dates (you are already doing so). This gives the result as an interval in Hours:Minuets:Seconds. For your example it returns 00:21:14 (see demo)
select r.order_id
, c.order_time
, r.pickup_time
, r.pickup_time - c.order_time as diff
from runner_orders1 r
join customer_orders1 c on c.order_id = r.order_id
where distance != 0;

PostgreSQL: How subtract dates in two consecutive rows which are in two different columns?

I need to subtract two dates from each other (start_time(n+1)-end_time(n)) which are in two different rows and columns by partitioning with id. Look at the table below:
row |id| Start_time | end_time
----+--+---------------------+------------------------
1 |14|"2012-06-16 21:43:00"|"2012-06-16 23:54:00"
2 |14|"2012-06-19 13:09:00"|"2012-06-19 23:59:00"
3 |21|"2016-04-12 14:46:00"|"2016-04-25 13:55:00"
4 |21|"2016-04-20 09:35:00"|"2016-04-20 22:00:00"
5 |24|"2011-09-19 19:20:00"|"2011-09-19 22:15:00"
6 |24|"2011-09-20 19:01:00"|"2011-09-22 14:05:00"
7 |30|"2009-10-21 07:25:00"|"2009-10-21 10:59:00"
8 |30|"2009-10-27 16:01:00"|"2009-11-10 16:00:00"
9 |30|"2009-10-28 08:13:00"|"2009-10-28 23:59:00"
10 |36|"2015-11-23 12:21:00"|"2015-11-23 15:19:00"
Example: for id=14, I need to subtract end_time in row 1 from start_time in row 2, and store the result in a new column. My main purpose is that to find the time difference between each measurement for every ids.
Is it possible at all?
My preferred results would be like diff column:
row |id| Start_time | end_time | diff
----+--+---------------------+---------------------+------------
1 |14|"2012-06-16 21:43:00"|"2012-06-16 23:54:00"|
2 |14|"2012-06-19 13:09:00"|"2012-06-19 23:59:00"|"2 days 13:15:00"
3 |21|"2016-04-12 14:46:00"|"2016-04-25 13:55:00"|
4 |21|"2016-04-20 09:35:00"|"2016-04-20 22:00:00"|"-5 days -04:20:00"
5 |24|"2011-09-19 19:20:00"|"2011-09-19 22:15:00"|
6 |24|"2011-09-20 19:01:00"|"2011-09-22 14:05:00"|"20:46:00"

How to add blank records for grouping based on formula in Crystal Reports

I have one table and group the records using formula, based on a string field which is formed as time (HH:mm:ss)
Formula is as followings:
select Minute (TimeValue ({MASTER.Saat}))
case 0 to 14: ReplicateString ("0", 2-len(TOTEXT(Hour (TimeValue ({MASTER.Saat})),0))) & TOTEXT(Hour (TimeValue ({MASTER.Saat})),0) & ":00:00"
case 15 to 29: ReplicateString ("0", 2-len(TOTEXT(Hour (TimeValue ({MASTER.Saat})),0))) & TOTEXT(Hour (TimeValue ({MASTER.Saat})),0) & ":15:00"
case 30 to 44: ReplicateString ("0", 2-len(TOTEXT(Hour (TimeValue ({MASTER.Saat})),0))) & TOTEXT(Hour (TimeValue ({MASTER.Saat})),0) & ":30:00"
case 45 to 59: ReplicateString ("0", 2-len(TOTEXT(Hour (TimeValue ({MASTER.Saat})),0))) & TOTEXT(Hour (TimeValue ({MASTER.Saat})),0) & ":45:00"
Actually, grouping works fine but my problem is that if there is no data in the table for a period, I can not show that in the report.
As an example;
Let my data has 5 records as following:
11:01:03
11:16:07
11:28:16
12:18:47
12:22:34
My report gives the result as following:
Period | Total Records
11:00:00 | 1
11:15:00 | 2
12:15:00 | 2
In this situation, I can not show the periods (which are missing in the table) as 0 for Total Records. I have to show as follows:
Period | Total Records
11:00:00 | 1
11:15:00 | 2
11:30:00 | 0
11:45:00 | 0
12:00:00 | 0
12:15:00 | 2
Thanks for all suggestions.
You can't group something that's not there. One way to solve this is to use a table that provides all of your intervals you want to look at (called a date, time or number table).
For your case create a table that contains all your period values (24x4). join the records you want to count to this table. In Crystal Reports group by the period values - your result set will contain the periods without any joined records - you can detect this and output a 0.
You may want to look a this question, it is similar to yours.

How to get recent value in Spark dataframe

I have a Spark data frame with following structure
id flag price date
a 0 100 2015
a 0 50 2015
a 1 200 2014
a 1 300 2013
a 0 400 2012
I need to create a data frame with recent value of flag 1 and updated in the flag 0 rows.
id flag price date new_column
a 0 100 2015 200
a 0 50 2015 200
a 1 200 2014 null
a 1 300 2013 null
a 0 400 2012 null
We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2 values(200 and 300) and I am taking the recent one 200(2014). And the last row I don't have any recent value for flag 1 so it is updated with null.
Looking for a solution using Scala. Any help would be appreciated.Thanks
You can try to use window functions.
Basically create a window where you partition by id and order by date. Then get the previous line for every line. Lastly, use when/otherwise to turn all cases with flag 1 to null.
Something like this:
val df = sc.parallelize(List(("a",0,100,2015),("a",1,200,2014),("a",1,300,2013),("a",0,400,2012))).toDF("id","flag","price","date")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag,when}
val wSpec1 = Window.partitionBy("id").orderBy("date")
val df2=df.withColumn("last1",when(df("flag")===0,lag('price, 1).over(wSpec1)).otherwise(null))

TIBCO Spotifre Analyst v7.10: Conditionally Create Two New Calculated Columns Based on Comparison of Two Date Columns

I currently have the sample table below in Spotfire Analyst 7.10. I want to create two new calculated columns using the data below (Insert --> Calculated Column).
Id Date cDate Value
--------------------------------------------
A 10/17/2017 10/18/2017 10
A 10/17/2017 10/14/2017 15
A 10/17/2017 10/8/2017 -2
B 11/19/2017 11/19/2017 4
B 11/19/2017 11/30/2017 3
Below is the logic I'm trying to implement:
a. for the SAME Id value, find the cDate that is equal to OR, if not equal, then closest to the date in Date column
b. the cDate selected CANNOT be greater than the Date value
c. create new column (Rel Date) that has the date fulfilling criteria "a" and "b" listed for each row associated with same Id
d. create another new column (Value) that pulls in Value associated with date selected in criteria "c"
Below is the output table I want after above logic is implemented:
Id Date cDate Rel Date Value
----------------------------------------------------------
A 10/17/2017 10/18/2017 10/14/2017 15
A 10/17/2017 10/14/2017 10/14/2017 15
A 10/17/2017 10/8/2017 10/14/2017 15
B 11/19/2017 11/19/2017 10/19/2017 4
B 11/19/2017 11/30/2017 10/19/2017 4
#PineNuts0- Here's how you could achieve this.
Step 1: Add a calculated column [diff] with the expression below which finds the difference between [Date] and [cDate] over column [Id].
If(Days([cDate] - Max([Date]) over (Intersect([Id],[Date])))>0,null,Days([cDate] - Max([Date]) over (Intersect([Id],[Date]))))
Step 2: Add another calculated column [MAX_diff] with the expression below which finds the max of [diff] column over [Id] and [Date].
Max([diff]) over (Intersect([Id],[Date]))
Step 3: Now, add a calculated column [GET_VAL] to get the value based on max difference between [Date] and [cDate].
Max([Value]) over (Intersect([MAX_diff],[Id]))
Step 4: Finally, create a calculated column [Rel Date] to get [cDate] based on the values we got in the previous step.
DateAdd("dd",max([diff]) over (Intersect([Id],[Date])),[Date])
Note: The calculated columns which were created in Step 1 and 2 can run in the background and it is not required to display them in the table.
Here is the final output: