Pandas: Combine resampling with groupby and calculate time differences - group-by

I am doing data analysis with trading data. I would like to use Pandas in order to examine the times when the traders are active.
In particular, I try to extract the difference in minutes between the dates of every first trade of every trader for each day and cumulate it to a monthly basis
The data looks like this:
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 16:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-02-01 10:00:00 | Mark | 10
2012-02-01 09:00:00 | John | 10
2012-02-01 17:00:00 | Mark | 10
Right now I use resampling to retrieve the first trade on a daily basis. However, I want to group also by the buyer to calculate the differences in their trading dates. Like this
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-01-02 10:00:00 | Mark | 10
2012-01-02 09:00:00 | John | 10
Overall I am looking to calculate the differences in minutes between the first trades on a daily basis for each trader.
Update
For example in the case of John on the 2012-01-01: Dist = 60 (Diff John-Mark) + 120 (Diff John-Kevin) = 180
I would highly appreciate if anyone has an idea how to do this.
Thank you

Your original frame (the resampled one)
In [71]: df_orig
Out[71]:
buyer date volume
0 John 2012-01-01 09:00:00 10
1 Mark 2012-01-01 10:00:00 10
2 Kevin 2012-01-01 11:00:00 10
3 Mark 2012-01-02 10:00:00 10
4 John 2012-01-02 09:00:00 10
Set the index to the date column, keeping the date column in place
In [75]: df = df_orig.set_index('date',drop=False)
Create this aggregation function
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
return frame
Groupby the single date
In [74]: x = df.groupby(pd.TimeGrouper('1d')).apply(f)
Create the differential in minutes
In [86]: x['diff'] = (x.date-x.start).apply(lambda x: float(x.item().total_seconds())/60)
In [87]: x
Out[87]:
buyer date volume start diff
date
2012-01-01 2012-01-01 09:00:00 John 2012-01-01 09:00:00 10 2012-01-01 09:00:00 0
2012-01-01 10:00:00 Mark 2012-01-01 10:00:00 10 2012-01-01 09:00:00 60
2012-01-01 11:00:00 Kevin 2012-01-01 11:00:00 10 2012-01-01 09:00:00 120
2012-01-02 2012-01-02 09:00:00 John 2012-01-02 09:00:00 10 2012-01-02 09:00:00 0
2012-01-02 10:00:00 Mark 2012-01-02 10:00:00 10 2012-01-02 09:00:00 60
Here's the explanation. We use the TimeGrouper to have the grouping by date, where a frame is passed to the function f. This function, then uses the first date of the day (the sort is necessary here). You subtract this from the date on the entry to get a timedelta64, which is then massaged to minutes (this is a bit hacky right now because of some numpy issues, should be more natural in 0.12)
Thanks for you update, I originally thought you wanted the diff per buyer, not from the first buyer, but that's just a minor tweak.
Update:
To track the buyer name as well (which corresponds to the start date), just include
it in the function f
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
frame['start_buyer'] = frame.buyer.iloc[0]
return frame
Then can groupby on this at the end:
In [14]: x.groupby(['start_buyer']).sum()
Out[14]:
diff
start_buyer
John 240

Related

Postgres: how to compare sales current year vs past year with current date

I have the following table "sales"
date revenue
2018-06-01 300
2018-06-02 400
2019-06-01 500
2019-06-02 700
and I want to compare sales current year vs past year with current date and get the following result:
date revenue 2019 revenue 2018
2019-06-01 500 300
2019-06-02 700 400
What query I should use?
The problem is that I should group 2018 revenue somehow to 2019 revenue.
You can do it with a self join:
select
s1.date,
s1.revenue revenue2019,
s2.revenue revenue2018
from sales s1 inner join sales s2
on s1.date = s2.date + interval '1' year
where date_part('year', s1.date) = 2019
See the demo.
Results:
> date | revenue2019 | revenue2018
> :--------- | ----------: | ----------:
> 2019-06-01 | 500 | 300
> 2019-06-02 | 700 | 400

Create calendar year variable from daily date

I have a Stata elapsed date variable (excerpt below) for each patient patid. But I believe I need to generate a new variable if I were to make use of only the year element within that date, that is, not just change the display format.
clear
input long patid float date
1015 18766
1018 13135
1020 13325
1025 14384
1029 14514
1050 13501
1070 14523
1071 14878
1090 14701
1092 14159
end
format %td date
How do I generate a year variable for all dates within the same year? That is all days from 1st January to 31st December of the same year?
That calls for just the function year(), which together with similar stuff is prominently documented at help datetime.
clear
input long patid float date
1015 18766
1018 13135
1020 13325
1025 14384
1029 14514
1050 13501
1070 14523
1071 14878
1090 14701
1092 14159
end
format %td date
gen year = year(date)
list
+--------------------------+
| patid date year |
|--------------------------|
1. | 1015 19may2011 2011 |
2. | 1018 18dec1995 1995 |
3. | 1020 25jun1996 1996 |
4. | 1025 20may1999 1999 |
5. | 1029 27sep1999 1999 |
|--------------------------|
6. | 1050 18dec1996 1996 |
7. | 1070 06oct1999 1999 |
8. | 1071 25sep2000 2000 |
9. | 1090 01apr2000 2000 |
10. | 1092 07oct1998 1998 |
+--------------------------+

Postgresql group by recurring items

I'm using postgresql to store historical data coming from an RTLS platform.
Position data is not collected continuosly.
The historical_movements is implemented as a single table as follow (it is a simplified table but enough to present the use case):
User Area EnterTime ExitTime
John room1 2018-01-01 10:00:00 2018-01-01 10:00:05
Doe room1 2018-01-01 10:00:00 2018-01-01 10:10:00
John room1 2018-01-01 10:05:00 2018-01-01 10:10:00
Doe room1 2018-01-01 10:20:00 2018-01-01 10:30:00
John room2 2018-01-01 11:00:00 2018-01-01 11:05:00
John room2 2018-01-01 11:08:00 2018-01-01 11:15:00
John room1 2018-01-01 12:00:00 2018-01-01 12:08:00
John room1 2018-01-01 12:10:00 2018-01-01 12:20:00
John room1 2018-01-01 12:25:00 2018-01-01 12:25:00
John room3 2018-01-01 12:30:00 2018-01-01 12:35:00
John room3 2018-01-01 12:40:00 2018-01-01 12:50:00
I'm looking at a way to make a query showing the user staying in the various rooms, aggregating the data related to the same room and computing the overall staying time, as follows
User Area EnterTime ExitTime ArregateTime
John room1 2018-01-01 10:00:00 2018-01-01 10:10:00 00:10:00
John room2 2018-01-01 11:00:00 2018-01-01 11:05:00 00:15:00
John room1 2018-01-01 12:00:00 2018-01-01 12:25:00 00:25:00
John room3 2018-01-01 12:30:00 2018-01-01 12:50:00 00:20:00
Doe room1 2018-01-01 10:00:00 2018-01-01 10:30:00 00:30:00
Looking at various threads I'm quite sure I'd have to use lag and partition by functions but it's not clear how.
Any hints?
Best regards.
AggregateTime isn't really an aggregate in your expected result. It seems to be a difference between max_time and min_time for each block where each block is set of contiguous rows with same (users, area).
with block as(
select users, area, entertime, exittime,
(row_number() over (order by users, entertime) -
row_number() over (partition by users, area order by entertime)
) as grp
from your_table
order by 1,2,3
)
select users, area, entertime, exittime, (exittime - entertime) as duration
from (select users, area, grp, min(entertime) as entertime, max(exittime) as exittime
from block
group by users, area, grp
) t2
order by 5;
I made some changes to 'Resetting Row number according to record data change' to arrive at the solution.

How to get rows between time intervals

I have delivery slots that has a from column (datetime).
Delivery slots are stored as 1 hour to 1 hour and 30 minute intervals, daily.
i.e. 3.00am-4.30am, 6.00am-7.30am, 9.00am-10.30am and so forth
id | from
------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
3 | 2016-01-01 06:00:00
4 | 2016-01-01 07:30:00
5 | 2016-01-01 09:00:00
6 | 2016-01-01 10:30:00
7 | 2016-01-01 12:00:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
10 | 2016-01-02 06:00:00
11 | 2016-01-02 07:30:00
12 | 2016-01-02 09:00:00
13 | 2016-01-02 10:30:00
14 | 2016-01-02 12:00:00
I’m trying to get all delivery_slots between the hours of 3.00am - 4.30 am. Ive got the following so far:
SELECT * FROM delivery_slots WHERE EXTRACT(HOUR FROM delivery_slots.from) >= 3 AND EXTRACT(MINUTE FROM delivery_slots.from) >= 0 AND EXTRACT(HOUR FROM delivery_slots.from) <= 4 AND EXTRACT(MINUTE FROM delivery_slots.from) <= 30;
Which kinda works. Kinda, because it is only returning delivery slots that have minutes of 00.
Thats because of the last where condition (EXTRACT(MINUTE FROM delivery_slots.from) <= 30)
To give you an idea, of what I am trying to expect:
id | from
-------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
15 | 2016-01-03 03:00:00
16 | 2016-01-03 04:30:00
etc...
Is there a better way to go about this?
Try this: (not tested)
SELECT * FROM delivery_slots WHERE delivery_slots.from::time >= '03:00:00' AND delivery_slots.from::time <= '04:30:00'
Hope this helps.
Cheers.
The easiest way to do this, in my mind, is to cast the from column as a type time and do a where >= and <=, like so
select * from testing where (date::time >= '3:00'::time and date::time <= '4:30'::time);

function to calculate aggregate sum count in postgresql

Is there a function that calculates the total count of the complete month like below? I am not sure if postgres. I am looking for the grand total value.
2012-08=# select date_trunc('day', time), count(distinct column) from table_name group by 1 order by 1;
date_trunc | count
---------------------+-------
2012-08-01 00:00:00 | 22
2012-08-02 00:00:00 | 34
2012-08-03 00:00:00 | 25
2012-08-04 00:00:00 | 30
2012-08-05 00:00:00 | 27
2012-08-06 00:00:00 | 31
2012-08-07 00:00:00 | 23
2012-08-08 00:00:00 | 28
2012-08-09 00:00:00 | 28
2012-08-10 00:00:00 | 28
2012-08-11 00:00:00 | 24
2012-08-12 00:00:00 | 36
2012-08-13 00:00:00 | 28
2012-08-14 00:00:00 | 23
2012-08-15 00:00:00 | 23
2012-08-16 00:00:00 | 30
2012-08-17 00:00:00 | 20
2012-08-18 00:00:00 | 30
2012-08-19 00:00:00 | 20
2012-08-20 00:00:00 | 24
2012-08-21 00:00:00 | 20
2012-08-22 00:00:00 | 17
2012-08-23 00:00:00 | 23
2012-08-24 00:00:00 | 25
2012-08-25 00:00:00 | 35
2012-08-26 00:00:00 | 18
2012-08-27 00:00:00 | 16
2012-08-28 00:00:00 | 11
2012-08-29 00:00:00 | 22
2012-08-30 00:00:00 | 26
2012-08-31 00:00:00 | 17
(31 rows)
--------------------------------
Total | 12345
As best I can guess from your question and comments you want sub-totals of the distinct counts by month. You can't do this with group by date_trunc('month',time) because that'll do a count(distinct column) that's distinct across all days.
For this you need a subquery or CTE:
WITH day_counts(day,day_col_count) AS (
select date_trunc('day', time), count(distinct column)
from table_name group by 1
)
SELECT 'Day', day, day_col_count
FROM day_counts
UNION ALL
SELECT 'Month', date_trunc('month', day), sum(day_col_count)
FROM day_counts
GROUP BY 2
ORDER BY 2;
My earlier guess before comments was: Group by month?
select date_trunc('month', time), count(distinct column)
from table_name
group by date_trunc('month', time)
order by time
Or are you trying to include running totals or subtotal lines? For running totals you need to use sum as a window function. Subtotals are just a pain, as SQL doesn't really lend its self to them; you need to UNION two queries then wrap them in an outer ORDER BY.
select
date_trunc('day', time)::text as "date",
count(distinct column) as count
from table_name
group by 1
union
select
'Total',
count(distinct column)
from table_name
group by 1, date_trunc('month', time)
order by "date" = 'Total', 1