SQL group-by statement with having-clause in Matlab - matlab

I have two tables in Matlab that I would like to merge, ´Returns´ and ´Yearly´, according to the following SQL statement. How do I merge them in Matlab? (I have to use Matlab)
select a.*, b.Equity, b.Date as Yearly_date from Returns a, Yearly b where a.Id = b.Id and a.Date >= b.Date group by a.Id, a.Date having max(b.Date) = b.Date
Here is some sample data:
Returns = table([repmat(1,5,1);repmat(2,6,1)],[(datetime(2013,10,31):calmonths(1):datetime(2014,2,28)).';(datetime(2013,10,31):calmonths(1):datetime(2014,3,31)).'],randn(11,1),'VariableNames',{'Id','Date','Return'})
Returns =
Id Date Return
__ ___________ ________
1 31-Oct-2013 -0.8095
1 30-Nov-2013 -2.9443
1 31-Dec-2013 1.4384
1 31-Jan-2014 0.32519
1 28-Feb-2014 -0.75493
2 31-Oct-2013 1.3703
2 30-Nov-2013 -1.7115
2 31-Dec-2013 -0.10224
2 31-Jan-2014 -0.24145
2 28-Feb-2014 0.31921
2 31-Mar-2014 0.31286
Yearly = table([repmat(1,3,1);repmat(2,2,1)],[(datetime(2011,12,31):calyears(1):datetime(2013,12,31)).';(datetime(2012,12,31):calyears(1):datetime(2013,12,31)).'],[8;10;11;30;28],'VariableNames',{'Id','Date','Equity'})
Yearly =
Id Date Equity
__ ___________ ______
1 31-Dec-2011 8
1 31-Dec-2012 10
1 31-Dec-2013 11
2 31-Dec-2012 30
2 31-Dec-2013 28
I would like the following output:
ans =
Id Date Return Equity Yearly_date
__ ___________ __________ ______ ___________
1 31-Oct-2013 -0.86488 10 31-Dec-2012
1 30-Nov-2013 -0.030051 10 31-Dec-2012
1 31-Dec-2013 -0.16488 11 31-Dec-2013
1 31-Jan-2014 0.62771 11 31-Dec-2013
1 28-Feb-2014 1.0933 11 31-Dec-2013
2 31-Oct-2013 1.1093 30 31-Dec-2012
2 30-Nov-2013 -0.86365 30 31-Dec-2012
2 31-Dec-2013 0.077359 28 31-Dec-2013
2 31-Jan-2014 -1.2141 28 31-Dec-2013
2 28-Feb-2014 -1.1135 28 31-Dec-2013
2 31-Mar-2014 -0.0068493 28 31-Dec-2013

Here goes another bsxfun based solution, abusing its masking capability -
%// Inputs
Returns = table([repmat(1,5,1);repmat(2,6,1)],[(datetime(2013,10,31):...
calmonths(1):datetime(2014,2,28)).';(datetime(2013,10,31):calmonths(1):...
datetime(2014,3,31)).'],randn(11,1),'VariableNames',{'Id','Date','Return'})
Yearly = table([repmat(1,3,1);repmat(2,2,1)],[(datetime(2011,12,31):...
calyears(1):datetime(2013,12,31)).';(datetime(2012,12,31):calyears(1):...
datetime(2013,12,31)).'],[8;10;11;30;28],'VariableNames',{'Id','Date','Equity'})
%// Get mask of matches for each ID in Returns against each ID in Yearly
matches = bsxfun(#ge,datenum(Returns.Date),datenum(Yearly.Date)'); %//'
%// Keep the matches within the respective Ids only
matches(~bsxfun(#ge,Returns.Id,Yearly.Id'))=0; %//'# Or matches(bsxfun(#lt,..)
%// Get the ID (column -ID) of the last match for each Id in Returns
[~,flipped_col_ID] = max(matches(:,end:-1:1),[],2);
col_ID = size(matches,2) - flipped_col_ID + 1;
%// Select the rows from Yearly based on col IDs and create the output table
out = [Returns table(Yearly.Equity(col_ID), Yearly.Date(col_ID))]
Code run -
Returns =
Id Date Return
__ ___________ ________
1 31-Oct-2013 0.045158
1 30-Nov-2013 0.071319
1 31-Dec-2013 0.52357
1 31-Jan-2014 -0.65424
1 28-Feb-2014 1.8452
2 31-Oct-2013 0.037262
2 30-Nov-2013 0.38369
2 31-Dec-2013 1.1972
2 31-Jan-2014 -0.54708
2 28-Feb-2014 -0.15706
2 31-Mar-2014 0.11882
Yearly =
Id Date Equity
__ ___________ ______
1 31-Dec-2011 8
1 31-Dec-2012 10
1 31-Dec-2013 11
2 31-Dec-2012 30
2 31-Dec-2013 28
out =
Id Date Return Var1 Var2
__ ___________ ________ ____ ___________
1 31-Oct-2013 0.045158 10 31-Dec-2012
1 30-Nov-2013 0.071319 10 31-Dec-2012
1 31-Dec-2013 0.52357 11 31-Dec-2013
1 31-Jan-2014 -0.65424 11 31-Dec-2013
1 28-Feb-2014 1.8452 11 31-Dec-2013
2 31-Oct-2013 0.037262 30 31-Dec-2012
2 30-Nov-2013 0.38369 30 31-Dec-2012
2 31-Dec-2013 1.1972 28 31-Dec-2013
2 31-Jan-2014 -0.54708 28 31-Dec-2013
2 28-Feb-2014 -0.15706 28 31-Dec-2013
2 31-Mar-2014 0.11882 28 31-Dec-2013
Generic case solution
For cases, when the Ids could be non-numeric and the dates aren't sorted already, you may try out the following code -
%// Inputs
Returns = table([repmat('Id1',5,1);repmat('Id2',6,1)],[(datetime(2013,10,31):...
calmonths(1):datetime(2014,2,28)).';(datetime(2013,10,31):calmonths(1):...
datetime(2014,3,31)).'],randn(11,1),'VariableNames',{'Id','Date','Return'})
Yearly = table([repmat('Id1',3,1);repmat('Id2',2,1)],[(datetime(2011,12,31):...
calyears(1):datetime(2013,12,31)).';(datetime(2012,12,31):calyears(1):...
datetime(2013,12,31)).'],[8;10;11;30;28],'VariableNames',{'Id','Date','Equity'})
%// -- Convert strings based Ids into numeric ones
alltypes = cellstr([Returns.Id ; Yearly.Id]);
[~,~,IDs] = unique(alltypes,'stable');
lbls_len = size(Returns.Id,1);
Returns_Id = IDs(1:lbls_len);
Yearly_Id = IDs(lbls_len+1:end);
%// Get Returns and Yearly Dates
Returns_Date = datenum(Returns.Date);
Yearly_Date = datenum(Yearly.Date);
%// Sort the dates if not already sorted
y1 = arrayfun(#(n) sort(Returns_Date(Returns_Id==n)),1:max(Returns_Id),'Uni',0);
Returns_Date = vertcat(y1{:});
y2 = arrayfun(#(n) sort(Yearly_Date(Yearly_Id==n)),1:max(Yearly_Id),'Uni',0);
Yearly_Date = vertcat(y2{:});
%// Counts of Ids to be used as boundaries when saving output at each
%// iteration correspondin to each ID
Yearly_Id_counts = [0 ; histc(Yearly_Id,1:max(Yearly_Id))];
Returns_Id_counts = histc(Returns_Id,1:max(Returns_Id));
%// Initializations
stop = 0;
col_ID = zeros(size(Returns_Date,1),1);
for iter = 1:max(Returns_Id)
%// Get mask of matches for each ID in Returns against each ID in Yearly
matches = bsxfun(#ge,Returns_Date(Returns_Id==iter),...
Yearly_Date(Yearly_Id==iter)'); %//'
%// Get the ID (column -ID) of the last match for each Id in Returns
[~,flipped_col_ID] = max(matches(:,end:-1:1),[],2);
%// Get start and stop for indexing into output column IDs array
start = stop + 1;
stop = start + Returns_Id_counts(iter) - 1;
%// Get the columns IDs to be used for indexing into Yearly data for
%// getting the final output
col_ID(start:stop) = Yearly_Id_counts(iter) + ...
Yearly_Id_counts(iter + 1) - flipped_col_ID + 1;
end
%// Select the rows from Yearly based on col IDs and create the output table
out = [Returns table(Yearly.Equity(col_ID), Yearly.Date(col_ID))]
Code run -
Returns =
Id Date Return
___ ___________ ________
Id1 31-Oct-2013 0.53767
Id1 30-Nov-2013 1.8339
Id1 31-Dec-2013 -2.2588
Id1 31-Jan-2014 0.86217
Id1 28-Feb-2014 0.31877
Id2 31-Oct-2013 -1.3077
Id2 30-Nov-2013 -0.43359
Id2 31-Dec-2013 0.34262
Id2 31-Jan-2014 3.5784
Id2 28-Feb-2014 2.7694
Id2 31-Mar-2014 -1.3499
Yearly =
Id Date Equity
___ ___________ ______
Id1 31-Dec-2011 8
Id1 31-Dec-2012 10
Id1 31-Dec-2013 11
Id2 31-Dec-2012 30
Id2 31-Dec-2013 28
out =
Id Date Return Var1 Var2
___ ___________ ________ ____ ___________
Id1 31-Oct-2013 0.53767 10 31-Dec-2012
Id1 30-Nov-2013 1.8339 10 31-Dec-2012
Id1 31-Dec-2013 -2.2588 11 31-Dec-2013
Id1 31-Jan-2014 0.86217 11 31-Dec-2013
Id1 28-Feb-2014 0.31877 11 31-Dec-2013
Id2 31-Oct-2013 -1.3077 30 31-Dec-2012
Id2 30-Nov-2013 -0.43359 30 31-Dec-2012
Id2 31-Dec-2013 0.34262 28 31-Dec-2013
Id2 31-Jan-2014 3.5784 28 31-Dec-2013
Id2 28-Feb-2014 2.7694 28 31-Dec-2013

Related

How can I evenly divide records into N groups based on the values?

For a table as follows, how can I divide these records evenly into 3 groups based on the value of “factor_value”?
sym date factor_value
------ ---------- ------------
100000 2022.04.27 1
100001 2022.04.27 2
100002 2022.04.27 3
100003 2022.04.27 4
100004 2022.04.27 5
100005 2022.04.27 6
100006 2022.04.27 7
100007 2022.04.27 8
100008 2022.04.27 9
100009 2022.04.27 10
100010 2022.04.28
100000 2022.04.28
100001 2022.04.28
100002 2022.04.28 3
100003 2022.04.28 4
100004 2022.04.28 5
100005 2022.04.28 6
100006 2022.04.28 7
100007 2022.04.28 8
100008 2022.04.28 9
This can be implemented by DolphinDB functions cutPoints and asof.
sym=take(string(100000..100010),20)
date=sort(take(2022.04.27..2022.04.28,20))
factor_value= 1..10 join take(int(),3) join 3..9
tb= table( sym, date, factor_value)
select *,asof(cutPoints(int(factor_value*100000),3),factor_value*100000)+1 as factor_quantile from tb context by date csort factor_value having size(distinct(factor_value*100000))>3
First, use contexy by with csort to sort the column factor_value. Then allocate the records into 3 groups evenly with cutPoints. asof returns the grouping number for each element in the group.
output:
sym date factor_value factor_quantile
------ ---------- ------------ ---------------
100000 2022.04.27 1 1
100001 2022.04.27 2 1
100002 2022.04.27 3 1
100003 2022.04.27 4 1
100004 2022.04.27 5 2
100005 2022.04.27 6 2
100006 2022.04.27 7 2
100007 2022.04.27 8 3
100008 2022.04.27 9 3
100009 2022.04.27 10 3
100010 2022.04.28 1
100000 2022.04.28 1
100001 2022.04.28 1
100002 2022.04.28 3 1
100003 2022.04.28 4 2
100004 2022.04.28 5 2
100005 2022.04.28 6 2
100006 2022.04.28 7 3
100007 2022.04.28 8 3
100008 2022.04.28 9 3

Why PySpark partitionBy isn't working properly?

I have a table as:
COL1
COL2
COL3
COMP
0005
2008-08-04
COMP
0009
2002-01-01
COMP
01.0
2002-01-01
COMP
0005
2008-01-01
COMP
0005
2001-10-20
CTEC
0009
2001-10-20
COMP
0005
2009-10-01
COMP
01.0
2003-07-01
COMP
02.0
2004-01-01
CTEC
0009
2021-09-24
At first I want to partition the table on COL1, then do another partition on COL2, then sort the COL3 in descending order. Then I'm trying to add row number.
I write:
windowSpec = Window.partitionBy(col("COL1")).partitionBy(col("COl2")).orderBy(desc("COL3"))
TBL = TBL.withColumn(f"RANK", F.row_number().over(windowSpec))
My expected output is this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
1
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
2
But the output I'm getting is like this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
2
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
3
Can anyone please help me to figure out where I'm doing the mistake??

How to consistently sum lists of values contained in a table?

I have the following two tables:
t1:([]sym:`AAPL`GOOG; histo_dates1:(2000.01.01+til 10;2000.01.01+til 10);histo_values1:(til 10;5+til 10));
t2:([]sym:`AAPL`GOOG; histo_dates2:(2000.01.05+til 5;2000.01.06+til 4);histo_values2:(til 5; 2+til 4));
What I want is to sum the histo_values of each symbol across the histo_dates, such that the resulting table would look like this:
t:([]sym:`AAPL`GOOG; histo_dates:(2000.01.01+til 10;2000.01.01+til 10);histo_values:(0 1 2 3 4 6 8 10 12 9;5 6 7 8 9 12 14 16 18 14))
So the resulting dates histo_dates should be the union of histo_dates1 and histo_dates2, and histo_values should be the sum of histo_values1 and histo_values2 across dates.
EDIT:
I insist on the union of the dates, as I want the resulting histo_dates to be the union of both histo_dates1 and histo_dates2.
There are a few ways. One would be to ungroup to remove nesting, join the tables, aggregate on sym/date and then regroup on sym:
q)0!select histo_dates:histo_dates1, histo_values:histo_values1 by sym from select sum histo_values1 by sym, histo_dates1 from ungroup[t1],cols[t1]xcol ungroup[t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
A possibly faster way would be to make each row a dictionary and then key the tables on sym and add them:
q)select sym:s, histo_dates:key each v, histo_values:value each v from (1!select s, d!'v from `s`d`v xcol t1)+(1!select s, d!'v from `s`d`v xcol t2)
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
Another option would be to use a plus join pj:
q)0!`sym xgroup 0!pj[ungroup `sym`histo_dates`histo_values xcol t1;2!ungroup `sym`histo_dates`histo_values xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
See here for more on plus joins: https://code.kx.com/v2/ref/pj/
EDIT:
To explicitly make sure the result has the union of the dates, you could use a union join:
q)0!`sym xgroup select sym,histo_dates,histo_values:hv1+hv2 from 0^uj[2!ungroup `sym`histo_dates`hv1 xcol t1;2!ungroup `sym`histo_dates`hv2 xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
another way:
// rename the columns to be common names, ungroup the tables, and place the key on `sym and `histo_dates
q){2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// add them together (or use pj in place of +), group on `sym
`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// and to test this matches t, remove the key from the resulting table
q)t~0!`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
1b
Another possible way using functional amend
//Column join the histo_dates* columns and get the distinct dates - drop idx
//Using a functional apply use the idx to determine which values to plus
//Join the two tables using sym as the key - Find the idx of common dates
(enlist `idx) _select sym,histo_dates:distinct each (histo_dates1,'histo_dates2),
histovalues:{#[x;z;+;y]}'[histo_values1;histo_values2;idx],idx from
update idx:(where each histo_dates1 in' histo_dates2) from ((1!t1) uj 1!t2)
One possible problem with this is that to get the idx, it depends on the date columns being sorted which is usually the case.

Replace DataFrame rows with most recent data based on key

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01

How do a simultaneous ascending and descending sort in KDB/Q

In SQL, one can do
SELECT from tbl ORDER BY col1, col2 DESC
In KDB, one can do
`col1 xasc select from tbl
or
`col2 xdesc select from tbl
But how does one sort by col1 ascending then by col2 descending in KDB/Q?
2 sorts.
Create example data:
q)show tbl:([]a:10?10;b:10?10;c:10?10)
a b c
-----
8 4 8
1 9 1
7 2 9
2 7 5
4 0 4
5 1 6
4 9 6
2 2 1
7 1 8
8 8 5
Do sorting:
q)`a xasc `b xdesc tbl
a b c
-----
1 9 1
2 7 5
2 2 1
4 9 6
4 0 4
5 1 6
7 2 9
7 1 8
8 8 5
8 4 8