TimeGrouper, pandas - group-by

I use TimeGrouper from pandas.tseries.resample to sum monthly return to 6M as follows:
6m_return = monthly_return.groupby(TimeGrouper(freq='6M')).aggregate(numpy.sum)
where monthly_return is like:
2008-07-01 0.003626
2008-08-01 0.001373
2008-09-01 0.040192
2008-10-01 0.027794
2008-11-01 0.012590
2008-12-01 0.026394
2009-01-01 0.008564
2009-02-01 0.007714
2009-03-01 -0.019727
2009-04-01 0.008888
2009-05-01 0.039801
2009-06-01 0.010042
2009-07-01 0.020971
2009-08-01 0.011926
2009-09-01 0.024998
2009-10-01 0.005213
2009-11-01 0.016804
2009-12-01 0.020724
2010-01-01 0.006322
2010-02-01 0.008971
2010-03-01 0.003911
2010-04-01 0.013928
2010-05-01 0.004640
2010-06-01 0.000744
2010-07-01 0.004697
2010-08-01 0.002553
2010-09-01 0.002770
2010-10-01 0.002834
2010-11-01 0.002157
2010-12-01 0.001034
The 6m_return is like:
2008-07-31 0.003626
2009-01-31 0.116907
2009-07-31 0.067688
2010-01-31 0.085986
2010-07-31 0.036890
2011-01-31 0.015283
However I want to get the 6m_return starting 6m from 7/2008 like the following:
2008-12-31 ...
2009-06-31 ...
2009-12-31 ...
2010-06-31 ...
2010-12-31 ...
Tried the different input options (i.e. loffset) in TimeGrouper but doesn't work.
Any suggestion will be really appreciated!

The problem can be solved by adding closed = 'left'
df.groupby(pd.TimeGrouper('6M', closed = 'left')).aggregate(numpy.sum)

TimeGrouper that is suggested in other answers is deprecated and will be removed from Pandas. It is replaced with Grouper. So a solution to your question using Grouper is:
df.groupby(pd.Grouper(freq='6M', closed='left')).aggregate(numpy.sum)

This is a workaround for what seems a bug, but give it a try and see if it works for you.
In [121]: ts = pandas.date_range('7/1/2008', periods=30, freq='MS')
In [122]: df = pandas.DataFrame(pandas.Series(range(len(ts)), index=ts))
In [124]: df[0] += 1
In [125]: df
Out[125]:
0
2008-07-01 1
2008-08-01 2
2008-09-01 3
2008-10-01 4
2008-11-01 5
2008-12-01 6
2009-01-01 7
2009-02-01 8
2009-03-01 9
2009-04-01 10
2009-05-01 11
2009-06-01 12
2009-07-01 13
2009-08-01 14
2009-09-01 15
2009-10-01 16
2009-11-01 17
2009-12-01 18
2010-01-01 19
2010-02-01 20
2010-03-01 21
2010-04-01 22
2010-05-01 23
2010-06-01 24
2010-07-01 25
2010-08-01 26
2010-09-01 27
2010-10-01 28
2010-11-01 29
2010-12-01 30
I've used integers to help confirm that the sums are correct. The workaround that seems to work is to add a month to the front of the dataframe to trick the TimeGrouper into doing what you need.
In [127]: df2 = pandas.DataFrame([0], index = [df.index.shift(-1, freq='MS')[0]])
In [129]: df2.append(df).groupby(pandas.TimeGrouper(freq='6M')).aggregate(numpy.sum)[1:]
Out[129]:
0
2008-12-31 21
2009-06-30 57
2009-12-31 93
2010-06-30 129
2010-12-31 165
Note the final [1:] is there to trim off the first group.

Related

How to average data per week?

I hope someone could help me. I am starting to use R.
1st of all I would like to know if it is possible to determine the week of the year with the day my data was collected using R. I made this manually, but takes long time and increases the chance of my making a mistake...
I also am interested in getting the average of each week. For example, I have 2 data points in week 21.
An example of my data:
enter image description here
Week Date Class 1 g/plant Total g/plant 10 berry weigh Brix
21 26/05/2022 34.53571429 34.53571429 25.7 11.55
21 28/05/2022 35.39285714 39.25 27.1 10.98
22 31/05/2022 41.17857143 41.17857143 22.8 11.8
22 03/06/2022 57.60714286 57.60714286 22.2 10.91
23 06/06/2022 23.67857143 23.67857143 26.4 12.3
23 09/06/2022 23.60714286 24.14285714 24.7 12.63
24 14/06/2022 18.82142857 19.78571429 26.4 12.8
24 18/06/2022 20.78571429 20.78571429 30 12.05
25 21/06/2022 3.178571429 3.25 22.2 10.3
25 23/06/2022 0 0 0 0
25 25/06/2022 0 0 0 0
26 28/06/2022 0 0 0 0
26 01/07/2022 0 0 0 0
27 05/07/2022 0 0 0 0
27 09/07/2022 0 0 0 0
28 12/07/2022 0 0 0 0
28 14/07/2022 0 0 0 0
28 16/07/2022 0 0 0 0
30 26/07/2022 50.89285714 50.89285714 27.6 9.85
30 29/07/2022 19.39285714 19.39285714 19.1 10.58
31 02/08/2022 68.57142857 68.57142857 25 8.91
31 06/08/2022 58.75 58.75 24.9 8.81
32 09/08/2022 46.57142857 46.57142857 17.7 8.92
32 11/08/2022 24.25 24.25 17.2 9.77
32 13/08/2022 32.14285714 32.14285714 16 20.41
33 16/08/2022 53.14285714 53.14285714 19.7 10.09
33 20/08/2022 57.96428571 59.25 17.8 9.49
34 25/08/2022 28.10714286 28.10714286 18 9.99
35 30/08/2022 81.03571429 81.60714286 19.6 10.89
35 02/09/2022 22.53571429 22.53571429 14.8 10.04
36 06/09/2022 36.53571429 38.96428571 17.9 11.18
36 09/09/2022 24.5 25.71428571 17.3 10.48
37 16/09/2022 57.35714286 60.96428571 21.2 12.21
38 21/09/2022 5.142857143 7.142857143 13.5 11.58
39 30/09/2022 29.9047619 31.76190476 16.4 15.49
40 07/10/2022 22.9047619 24.47619048 16.4 15.12
41 12/10/2022 14.61904762 14.85714286 12.5 14.14
42 19/10/2022 15.57142857 17.04761905 15.6 14.24
43 26/10/2022 20.14285714 22.0952381 17.6 12.32
Thank you in advance!
Alex
I am interested in getting the average of each week. For example, I have 2 data points in week 21.
I am not sure what to do.

Network data in SQL Server - identifying separate routes

I am looking for help with a database task, which probably will be easier to solve by some object programming language. At this moment I keep trying to find TSQL/SQL Server solution of it.
I use a source table which contains data about routes. Each record describes a link of a route with routeNo, originNodeID and destinationNodeID. The most complicated example of data from this table looks like below:
routeID originNodeID destinationNodeID
WRTV ... ...
WRTX 5 10
WRTX 10 15
WRTX 15 20
WRTX 20 25
WRTX 25 30
WRTX 25 1505
WRTX 25 2005
WRTX 30 35
WRTX 30 1005
WRTX 35 40
WRTX 40 45
WRTX 45 50
WRTX 1005 1010
WRTX 1015 1020
WRTX 1505 1510
WRTX 1510 1515
WRTX 2005 2010
WRTX 2010 2015
WRTX 2020 2025
WRTY .... ....
So, as you can see each routeID describes not a linear route but route with branches. The route from the example may look like this:
1515 1020
/ /
/ /
5 ------ 25 --- 30 -------50
\
\
2025
Now, what I need to do is to dismember this route to separate routes:
5-25-30-50 WRTX1
5-25-30-1020 WRTX2
5-25-1515 WRTX3
5-25-2025 WRTX4
For each of the new routes I just need the link sequence like below:
routeID originNodeID destinationNodeID
WRTX1 5 10
WRTX1 10 15
WRTX1 15 20
WRTX1 20 25
WRTX1 25 30
WRTX1 30 35
WRTX1 35 40
WRTX1 40 45
WRTX1 45 50
WRTX2 5 10
WRTX2 10 15
WRTX2 15 20
WRTX2 20 25
WRTX2 25 30
WRTX2 30 1005
WRTX2 1005 1010
WRTX2 1015 1020
WRTX3 5 10
WRTX3 10 15
WRTX3 15 20
WRTX3 20 25
WRTX3 25 1505
WRTX3 1505 1510
WRTX3 1510 1515
WRTX4 5 10
WRTX4 10 15
WRTX4 15 20
WRTX4 20 25
WRTX4 25 2005
WRTX4 2005 2010
WRTX4 2010 2015
WRTX4 2020 2025
Do you have any idea how to solve my problem ? Preferably I would like to make this solution in SQL Server, but I had only little experience in loops and cursors which possibly could be useful in that case. Once I even made an ETL, but it was working only when there was only one point where the route splits.
I would be grateful for any help.
Not all of the actions you need to program in sql. There is no universal programming languages. Some action needs to be done in other programming languages. For your tasks it is better suited to additionally use python with sql databases.
You can edit a row in python and insert into the database. Can you give an example script, but you must bring the correct string "5-25-30-50 WRTX1 5-25-30-1020 WRTX2 5-25-1515 WRTX3 5-25-2025 WRTX4" and correct example of a data table.
In your table there is the number 10, but they are not in the line above. In this regard, do not understand the mechanism of decomposition of the string "5-25-30-50 WRTX1 5-25-30-1020 WRTX2 5-25-1515 WRTX3 5-25-2025 WRTX4". For example, "5-25-30-50 WRTX1" decompose ""5 25 WRTX1", ""25 30 WRTX1", ""30 50 WRTX1"? And so on?
EXAMPLE FOR PYTHON + MSSQL
import pymssql
import re
ServName = 'YourMSSQLServName'
DBName = 'YourDBName'
conn = pymssql.connect(server=ServName, database=DBName)
cursor = conn.cursor()
querytxt = '''
INSERT INTO [routing]
([routeID] ,[originNodeID] ,[destinationNodeID])
VALUES
('#routeID', #originNodeID , #destinationNodeID)'''
limit = 1000
Mask = 'WRTX'
F = open('rote.txt', 'r')
L = [R.strip() for R in F]
for Line in L:
LineLast = Line
j = 1
while len(LineLast) != 0:
PingLines = LineLast.partition(Mask)[0].strip()
LineTemp = LineLast.partition(Mask)[2].strip()
Num = LineTemp[0]
LineLast = LineTemp.partition(' ')[2]
PingSet = PingLines.split('-')
i = 0
while i < len(PingSet)-1:
Ping1 = PingSet[i]
Ping2 = PingSet[i+1]
i = i + 1
routeID = Mask + Num
originNodeID = Ping1
destinationNodeID = Ping2
print('Mask = %s\tPing1 = %s\tPing2 = %s' % (routeID , originNodeID, destinationNodeID))
query = querytxt.replace('#routeID', routeID)
query = query.replace('#originNodeID', originNodeID)
query = query.replace('#destinationNodeID', destinationNodeID)
cursor.execute(query)
conn.commit()
if i >= limit : break
j = j + 1
if j >= limit : break
F.close()

KDB/Q sequence generation similar to R's seq(from, to, step)

Is there a way to generate sequence of number with a given step, similar to R's seq(from, to, step) function?
e.g.
> seq(1,20,2)
[1] 1 3 5 7 9 11 13 15 17 19
user2393012's answer is close, but not exactly what question was looking for. The below works well -
q)seq:{x+z*til ceiling(1+y-x)%z}
q)seq[1;20;2]
1 3 5 7 9 11 13 15 17 19
An alternative (but not better than the simpler arithmetic solutions)
q){-1_(y>=)(z+)\x}[1;20;2]
1 3 5 7 9 11 13 15 17 19
Simply use arithmetic :-)
q){[step;start;length] start+step*til length}[2;0;10]
0 2 4 6 8 10 12 14 16 18
q){[step;start;length] start+step*til length}[3;0;10]
0 3 6 9 12 15 18 21 24 27
Another option (slight variation of terrylynch solution):
q) {(z+)\[floor(y-x)%z;x]} [1;20;2]
1 3 5 7 9 11 13 15 17 19

Deleting a row in a loop

I have a loop:
for i=1:size(A,1),
if A(i,4:6) == [0,0,3.4]
K = [K; A(i,:)];
end
end
and I would like to delete the last row in the matrix but I do not know what number row it will be. How do I delete the last row in the matrix in the loop? Or should I do it after the loop?
Why do you have loop? it is a one time action, not something you do several times.
check this out, I delete the last row:
>> a = magic(5);
>> a
a =
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
11 18 25 2 9
>> a = a(1:end-1,:);
>> a
a =
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
you can refer to last row by END keyword:
A= A(1:end-1, :)

I need to retrieve last of these results? (T-SQL)

This query:
SELECT refPatient_id,actDate,refReason_id,refClinic_id,active
FROM PatientClinicHistory
WHERE refClinic_id = 24
GROUP BY refPatient_id,actDate,refReason_id,refClinic_id,active
ORDER BY refPatient_id,actDate
returns this result:
refPatient_id actDate refReason_id refClinic_id active
============= ==================== ============ ============ ======
15704 2009-02-09 12:48:00 19 24 0
15704 2009-02-10 10:25:00 23 24 1
15704 2009-02-10 10:26:00 19 24 0
15704 2009-02-12 10:16:00 23 24 1
15704 2009-02-13 15:41:00 19 24 0
15704 2009-04-14 17:48:00 19 24 0
15704 2009-06-24 16:06:00 19 24 0
15731 2009-05-20 12:19:00 19 24 0
16108 2009-07-20 11:08:00 19 24 0
16139 2009-03-02 13:55:00 19 24 0
16569 2009-07-13 15:57:00 20 24 0
17022 2009-06-02 16:02:00 19 24 0
17022 2009-08-19 15:08:00 19 24 0
17022 2009-09-01 15:47:00 21 24 0
17049 2009-02-02 16:49:00 19 24 0
17049 2009-02-04 15:16:00 19 24 0
17063 2009-07-22 11:35:00 21 24 0
17063 2009-07-28 10:14:00 22 24 1
17502 2008-12-15 17:25:00 19 24 0
I need to get every patient's last passive action row (active = 0) (So I need to obtain the maximum actDate for each patient).
Should I write a new query after I get all these results in order to filter it?
Edited:
Thank you for your responses, actually I need to get last action for each patient.
e.g:
17022 2009-06-02 16:02:00 19 24 0
17022 2009-08-19 15:08:00 19 24 0
17022 2009-09-01 15:47:00 21 24 0
I need to filter the last row(max actDate for each patient).
17022 2009-09-01 15:47:00 21 24 0
You could try taking the actDate out of the group by and using max function max(actDate)
like
SELECT refPatient_id,Max(actDate),refReason_id,refClinic_id,active
FROM PatientClinicHistory
WHERE refClinic_id = 24
AND active = 0
GROUP BY refPatient_id,refReason_id,refClinic_id,active
ORDER BY refPatient_id
You could use a CTE
;WITH PatientClinicHistoryNew AS
(
SELECT refPatient_id,actDate,refReason_id,refClinic_id,active
FROM PatientClinicHistory
WHERE refClinic_id = 24
GROUP BY refPatient_id,actDate,refReason_id,refClinic_id,active
ORDER BY refPatient_id,actDate
)
SELECT refPatient_id, Max (actDate)
FROM PatientClinicHistoryNew
WHERE 1=1
AND active = 0
GROUP BY refPatient_id
SELECT refPatient_id,MAX(actDate)
FROM PatientClinicHistory
WHERE refClinic_id = 24
GROUP BY refPatient_id
will calculate maximum actDate for each patient. Is it what you want?