I have two tables in Matlab that I would like to merge, ´Returns´ and ´Yearly´, according to the following SQL statement. How do I merge them in Matlab? (I have to use Matlab)
select a.*, b.Equity, b.Date as Yearly_date from Returns a, Yearly b where a.Id = b.Id and a.Date >= b.Date group by a.Id, a.Date having max(b.Date) = b.Date
Here is some sample data:
Returns = table([repmat(1,5,1);repmat(2,6,1)],[(datetime(2013,10,31):calmonths(1):datetime(2014,2,28)).';(datetime(2013,10,31):calmonths(1):datetime(2014,3,31)).'],randn(11,1),'VariableNames',{'Id','Date','Return'})
Returns =
Id Date Return
__ ___________ ________
1 31-Oct-2013 -0.8095
1 30-Nov-2013 -2.9443
1 31-Dec-2013 1.4384
1 31-Jan-2014 0.32519
1 28-Feb-2014 -0.75493
2 31-Oct-2013 1.3703
2 30-Nov-2013 -1.7115
2 31-Dec-2013 -0.10224
2 31-Jan-2014 -0.24145
2 28-Feb-2014 0.31921
2 31-Mar-2014 0.31286
Yearly = table([repmat(1,3,1);repmat(2,2,1)],[(datetime(2011,12,31):calyears(1):datetime(2013,12,31)).';(datetime(2012,12,31):calyears(1):datetime(2013,12,31)).'],[8;10;11;30;28],'VariableNames',{'Id','Date','Equity'})
Yearly =
Id Date Equity
__ ___________ ______
1 31-Dec-2011 8
1 31-Dec-2012 10
1 31-Dec-2013 11
2 31-Dec-2012 30
2 31-Dec-2013 28
I would like the following output:
ans =
Id Date Return Equity Yearly_date
__ ___________ __________ ______ ___________
1 31-Oct-2013 -0.86488 10 31-Dec-2012
1 30-Nov-2013 -0.030051 10 31-Dec-2012
1 31-Dec-2013 -0.16488 11 31-Dec-2013
1 31-Jan-2014 0.62771 11 31-Dec-2013
1 28-Feb-2014 1.0933 11 31-Dec-2013
2 31-Oct-2013 1.1093 30 31-Dec-2012
2 30-Nov-2013 -0.86365 30 31-Dec-2012
2 31-Dec-2013 0.077359 28 31-Dec-2013
2 31-Jan-2014 -1.2141 28 31-Dec-2013
2 28-Feb-2014 -1.1135 28 31-Dec-2013
2 31-Mar-2014 -0.0068493 28 31-Dec-2013
Here goes another bsxfun based solution, abusing its masking capability -
%// Inputs
Returns = table([repmat(1,5,1);repmat(2,6,1)],[(datetime(2013,10,31):...
calmonths(1):datetime(2014,2,28)).';(datetime(2013,10,31):calmonths(1):...
datetime(2014,3,31)).'],randn(11,1),'VariableNames',{'Id','Date','Return'})
Yearly = table([repmat(1,3,1);repmat(2,2,1)],[(datetime(2011,12,31):...
calyears(1):datetime(2013,12,31)).';(datetime(2012,12,31):calyears(1):...
datetime(2013,12,31)).'],[8;10;11;30;28],'VariableNames',{'Id','Date','Equity'})
%// Get mask of matches for each ID in Returns against each ID in Yearly
matches = bsxfun(#ge,datenum(Returns.Date),datenum(Yearly.Date)'); %//'
%// Keep the matches within the respective Ids only
matches(~bsxfun(#ge,Returns.Id,Yearly.Id'))=0; %//'# Or matches(bsxfun(#lt,..)
%// Get the ID (column -ID) of the last match for each Id in Returns
[~,flipped_col_ID] = max(matches(:,end:-1:1),[],2);
col_ID = size(matches,2) - flipped_col_ID + 1;
%// Select the rows from Yearly based on col IDs and create the output table
out = [Returns table(Yearly.Equity(col_ID), Yearly.Date(col_ID))]
Code run -
Returns =
Id Date Return
__ ___________ ________
1 31-Oct-2013 0.045158
1 30-Nov-2013 0.071319
1 31-Dec-2013 0.52357
1 31-Jan-2014 -0.65424
1 28-Feb-2014 1.8452
2 31-Oct-2013 0.037262
2 30-Nov-2013 0.38369
2 31-Dec-2013 1.1972
2 31-Jan-2014 -0.54708
2 28-Feb-2014 -0.15706
2 31-Mar-2014 0.11882
Yearly =
Id Date Equity
__ ___________ ______
1 31-Dec-2011 8
1 31-Dec-2012 10
1 31-Dec-2013 11
2 31-Dec-2012 30
2 31-Dec-2013 28
out =
Id Date Return Var1 Var2
__ ___________ ________ ____ ___________
1 31-Oct-2013 0.045158 10 31-Dec-2012
1 30-Nov-2013 0.071319 10 31-Dec-2012
1 31-Dec-2013 0.52357 11 31-Dec-2013
1 31-Jan-2014 -0.65424 11 31-Dec-2013
1 28-Feb-2014 1.8452 11 31-Dec-2013
2 31-Oct-2013 0.037262 30 31-Dec-2012
2 30-Nov-2013 0.38369 30 31-Dec-2012
2 31-Dec-2013 1.1972 28 31-Dec-2013
2 31-Jan-2014 -0.54708 28 31-Dec-2013
2 28-Feb-2014 -0.15706 28 31-Dec-2013
2 31-Mar-2014 0.11882 28 31-Dec-2013
Generic case solution
For cases, when the Ids could be non-numeric and the dates aren't sorted already, you may try out the following code -
%// Inputs
Returns = table([repmat('Id1',5,1);repmat('Id2',6,1)],[(datetime(2013,10,31):...
calmonths(1):datetime(2014,2,28)).';(datetime(2013,10,31):calmonths(1):...
datetime(2014,3,31)).'],randn(11,1),'VariableNames',{'Id','Date','Return'})
Yearly = table([repmat('Id1',3,1);repmat('Id2',2,1)],[(datetime(2011,12,31):...
calyears(1):datetime(2013,12,31)).';(datetime(2012,12,31):calyears(1):...
datetime(2013,12,31)).'],[8;10;11;30;28],'VariableNames',{'Id','Date','Equity'})
%// -- Convert strings based Ids into numeric ones
alltypes = cellstr([Returns.Id ; Yearly.Id]);
[~,~,IDs] = unique(alltypes,'stable');
lbls_len = size(Returns.Id,1);
Returns_Id = IDs(1:lbls_len);
Yearly_Id = IDs(lbls_len+1:end);
%// Get Returns and Yearly Dates
Returns_Date = datenum(Returns.Date);
Yearly_Date = datenum(Yearly.Date);
%// Sort the dates if not already sorted
y1 = arrayfun(#(n) sort(Returns_Date(Returns_Id==n)),1:max(Returns_Id),'Uni',0);
Returns_Date = vertcat(y1{:});
y2 = arrayfun(#(n) sort(Yearly_Date(Yearly_Id==n)),1:max(Yearly_Id),'Uni',0);
Yearly_Date = vertcat(y2{:});
%// Counts of Ids to be used as boundaries when saving output at each
%// iteration correspondin to each ID
Yearly_Id_counts = [0 ; histc(Yearly_Id,1:max(Yearly_Id))];
Returns_Id_counts = histc(Returns_Id,1:max(Returns_Id));
%// Initializations
stop = 0;
col_ID = zeros(size(Returns_Date,1),1);
for iter = 1:max(Returns_Id)
%// Get mask of matches for each ID in Returns against each ID in Yearly
matches = bsxfun(#ge,Returns_Date(Returns_Id==iter),...
Yearly_Date(Yearly_Id==iter)'); %//'
%// Get the ID (column -ID) of the last match for each Id in Returns
[~,flipped_col_ID] = max(matches(:,end:-1:1),[],2);
%// Get start and stop for indexing into output column IDs array
start = stop + 1;
stop = start + Returns_Id_counts(iter) - 1;
%// Get the columns IDs to be used for indexing into Yearly data for
%// getting the final output
col_ID(start:stop) = Yearly_Id_counts(iter) + ...
Yearly_Id_counts(iter + 1) - flipped_col_ID + 1;
end
%// Select the rows from Yearly based on col IDs and create the output table
out = [Returns table(Yearly.Equity(col_ID), Yearly.Date(col_ID))]
Code run -
Returns =
Id Date Return
___ ___________ ________
Id1 31-Oct-2013 0.53767
Id1 30-Nov-2013 1.8339
Id1 31-Dec-2013 -2.2588
Id1 31-Jan-2014 0.86217
Id1 28-Feb-2014 0.31877
Id2 31-Oct-2013 -1.3077
Id2 30-Nov-2013 -0.43359
Id2 31-Dec-2013 0.34262
Id2 31-Jan-2014 3.5784
Id2 28-Feb-2014 2.7694
Id2 31-Mar-2014 -1.3499
Yearly =
Id Date Equity
___ ___________ ______
Id1 31-Dec-2011 8
Id1 31-Dec-2012 10
Id1 31-Dec-2013 11
Id2 31-Dec-2012 30
Id2 31-Dec-2013 28
out =
Id Date Return Var1 Var2
___ ___________ ________ ____ ___________
Id1 31-Oct-2013 0.53767 10 31-Dec-2012
Id1 30-Nov-2013 1.8339 10 31-Dec-2012
Id1 31-Dec-2013 -2.2588 11 31-Dec-2013
Id1 31-Jan-2014 0.86217 11 31-Dec-2013
Id1 28-Feb-2014 0.31877 11 31-Dec-2013
Id2 31-Oct-2013 -1.3077 30 31-Dec-2012
Id2 30-Nov-2013 -0.43359 30 31-Dec-2012
Id2 31-Dec-2013 0.34262 28 31-Dec-2013
Id2 31-Jan-2014 3.5784 28 31-Dec-2013
Id2 28-Feb-2014 2.7694 28 31-Dec-2013
Related
For a table as follows, how can I divide these records evenly into 3 groups based on the value of “factor_value”?
sym date factor_value
------ ---------- ------------
100000 2022.04.27 1
100001 2022.04.27 2
100002 2022.04.27 3
100003 2022.04.27 4
100004 2022.04.27 5
100005 2022.04.27 6
100006 2022.04.27 7
100007 2022.04.27 8
100008 2022.04.27 9
100009 2022.04.27 10
100010 2022.04.28
100000 2022.04.28
100001 2022.04.28
100002 2022.04.28 3
100003 2022.04.28 4
100004 2022.04.28 5
100005 2022.04.28 6
100006 2022.04.28 7
100007 2022.04.28 8
100008 2022.04.28 9
This can be implemented by DolphinDB functions cutPoints and asof.
sym=take(string(100000..100010),20)
date=sort(take(2022.04.27..2022.04.28,20))
factor_value= 1..10 join take(int(),3) join 3..9
tb= table( sym, date, factor_value)
select *,asof(cutPoints(int(factor_value*100000),3),factor_value*100000)+1 as factor_quantile from tb context by date csort factor_value having size(distinct(factor_value*100000))>3
First, use contexy by with csort to sort the column factor_value. Then allocate the records into 3 groups evenly with cutPoints. asof returns the grouping number for each element in the group.
output:
sym date factor_value factor_quantile
------ ---------- ------------ ---------------
100000 2022.04.27 1 1
100001 2022.04.27 2 1
100002 2022.04.27 3 1
100003 2022.04.27 4 1
100004 2022.04.27 5 2
100005 2022.04.27 6 2
100006 2022.04.27 7 2
100007 2022.04.27 8 3
100008 2022.04.27 9 3
100009 2022.04.27 10 3
100010 2022.04.28 1
100000 2022.04.28 1
100001 2022.04.28 1
100002 2022.04.28 3 1
100003 2022.04.28 4 2
100004 2022.04.28 5 2
100005 2022.04.28 6 2
100006 2022.04.28 7 3
100007 2022.04.28 8 3
100008 2022.04.28 9 3
I have a table as:
COL1
COL2
COL3
COMP
0005
2008-08-04
COMP
0009
2002-01-01
COMP
01.0
2002-01-01
COMP
0005
2008-01-01
COMP
0005
2001-10-20
CTEC
0009
2001-10-20
COMP
0005
2009-10-01
COMP
01.0
2003-07-01
COMP
02.0
2004-01-01
CTEC
0009
2021-09-24
At first I want to partition the table on COL1, then do another partition on COL2, then sort the COL3 in descending order. Then I'm trying to add row number.
I write:
windowSpec = Window.partitionBy(col("COL1")).partitionBy(col("COl2")).orderBy(desc("COL3"))
TBL = TBL.withColumn(f"RANK", F.row_number().over(windowSpec))
My expected output is this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
1
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
2
But the output I'm getting is like this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
2
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
3
Can anyone please help me to figure out where I'm doing the mistake??
I have the following two tables:
t1:([]sym:`AAPL`GOOG; histo_dates1:(2000.01.01+til 10;2000.01.01+til 10);histo_values1:(til 10;5+til 10));
t2:([]sym:`AAPL`GOOG; histo_dates2:(2000.01.05+til 5;2000.01.06+til 4);histo_values2:(til 5; 2+til 4));
What I want is to sum the histo_values of each symbol across the histo_dates, such that the resulting table would look like this:
t:([]sym:`AAPL`GOOG; histo_dates:(2000.01.01+til 10;2000.01.01+til 10);histo_values:(0 1 2 3 4 6 8 10 12 9;5 6 7 8 9 12 14 16 18 14))
So the resulting dates histo_dates should be the union of histo_dates1 and histo_dates2, and histo_values should be the sum of histo_values1 and histo_values2 across dates.
EDIT:
I insist on the union of the dates, as I want the resulting histo_dates to be the union of both histo_dates1 and histo_dates2.
There are a few ways. One would be to ungroup to remove nesting, join the tables, aggregate on sym/date and then regroup on sym:
q)0!select histo_dates:histo_dates1, histo_values:histo_values1 by sym from select sum histo_values1 by sym, histo_dates1 from ungroup[t1],cols[t1]xcol ungroup[t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
A possibly faster way would be to make each row a dictionary and then key the tables on sym and add them:
q)select sym:s, histo_dates:key each v, histo_values:value each v from (1!select s, d!'v from `s`d`v xcol t1)+(1!select s, d!'v from `s`d`v xcol t2)
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
Another option would be to use a plus join pj:
q)0!`sym xgroup 0!pj[ungroup `sym`histo_dates`histo_values xcol t1;2!ungroup `sym`histo_dates`histo_values xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
See here for more on plus joins: https://code.kx.com/v2/ref/pj/
EDIT:
To explicitly make sure the result has the union of the dates, you could use a union join:
q)0!`sym xgroup select sym,histo_dates,histo_values:hv1+hv2 from 0^uj[2!ungroup `sym`histo_dates`hv1 xcol t1;2!ungroup `sym`histo_dates`hv2 xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
another way:
// rename the columns to be common names, ungroup the tables, and place the key on `sym and `histo_dates
q){2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// add them together (or use pj in place of +), group on `sym
`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// and to test this matches t, remove the key from the resulting table
q)t~0!`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
1b
Another possible way using functional amend
//Column join the histo_dates* columns and get the distinct dates - drop idx
//Using a functional apply use the idx to determine which values to plus
//Join the two tables using sym as the key - Find the idx of common dates
(enlist `idx) _select sym,histo_dates:distinct each (histo_dates1,'histo_dates2),
histovalues:{#[x;z;+;y]}'[histo_values1;histo_values2;idx],idx from
update idx:(where each histo_dates1 in' histo_dates2) from ((1!t1) uj 1!t2)
One possible problem with this is that to get the idx, it depends on the date columns being sorted which is usually the case.
I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01
In SQL, one can do
SELECT from tbl ORDER BY col1, col2 DESC
In KDB, one can do
`col1 xasc select from tbl
or
`col2 xdesc select from tbl
But how does one sort by col1 ascending then by col2 descending in KDB/Q?
2 sorts.
Create example data:
q)show tbl:([]a:10?10;b:10?10;c:10?10)
a b c
-----
8 4 8
1 9 1
7 2 9
2 7 5
4 0 4
5 1 6
4 9 6
2 2 1
7 1 8
8 8 5
Do sorting:
q)`a xasc `b xdesc tbl
a b c
-----
1 9 1
2 7 5
2 2 1
4 9 6
4 0 4
5 1 6
7 2 9
7 1 8
8 8 5
8 4 8