Stata merge with multiple match variables - merge

I am having difficulty combining datasets for a project. Our primary dataset is organized by individual judges. It is an attribute dataset.
judge
j | x | y | z
----|----|----|----
1 | 2 | 3 | 4
2 | 5 | 6 | 7
The second dataset is a case database. Each observation is a case and judges can appear in one of three variables.
case
case | j1 | j2 | j3 | year
-----|----|----|----|-----
1 | 1 | 2 | 3 | 2002
2 | 2 | 3 | 1 | 1997
We would like to merge the case database into the attribute database, matching by judge. So, for each case that a judge appears in j1, j2, or j3, an observation for that case would be added creating a dataset that looks like below.
combined
j | x | y | z | case | year
---|----|----|----|-------|--------
1 | 2 | 3 | 4 | 1 | 2002
1 | 2 | 3 | 4 | 2 | 1997
2 | 5 | 6 | 7 | 1 | 2002
2 | 5 | 6 | 7 | 2 | 1997
My best guess is to use
rename j1 j
merge 1:m j using case
rename j j1
rename j2 j
merge 1:m j using case
However, I am unsure that this will work, especially since the merging dataset has three possible variables that the j identification can occur in.

Your examples are clear, but even better would be present them as code that would not require engineering edits to remove the scaffolding. See dataex from SSC (ssc inst dataex).
It's a case of the missing reshape, I think.
clear
input j x y z
1 2 3 4
2 5 6 7
end
save judge
clear
input case j1 j2 j3 year
1 1 2 3 2002
2 2 3 1 1997
end
reshape long j , i(case) j(which)
merge m:1 j using judge
list
+-------------------------------------------------------+
| case which j year x y z _merge |
|-------------------------------------------------------|
1. | 1 1 1 2002 2 3 4 matched (3) |
2. | 2 3 1 1997 2 3 4 matched (3) |
3. | 2 1 2 1997 5 6 7 matched (3) |
4. | 1 2 2 2002 5 6 7 matched (3) |
5. | 2 2 3 1997 . . . master only (1) |
|-------------------------------------------------------|
6. | 1 3 3 2002 . . . master only (1) |
+-------------------------------------------------------+
drop if _merge < 3
list
+---------------------------------------------------+
| case which j year x y z _merge |
|---------------------------------------------------|
1. | 1 1 1 2002 2 3 4 matched (3) |
2. | 2 3 1 1997 2 3 4 matched (3) |
3. | 2 1 2 1997 5 6 7 matched (3) |
4. | 1 2 2 2002 5 6 7 matched (3) |
+---------------------------------------------------+

Related

kdb union join (with plus join)

I have been stuck on this for a while now, but cannot come up with a solution, any help would be appriciated
I have 2 table like
q)x
a b c d
--------
1 x 10 1
2 y 20 1
3 z 30 1
q)y
a b| c d
---| ----
1 x| 1 10
3 h| 2 20
Would like to sum the common columns and append the new ones. Expected result should be
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
pj looks to only update the (1,x) but doesn't insert the new (3,h). I am assuming there has to be a way to do some sort of union+plus join in kdb
You can take advantage of the plus (+) operator here by simply keying x and adding the table y to get the desired table:
q)(2!x)+y
a b| c d
---| -----
1 x| 11 11
2 y| 20 1
3 z| 30 1
3 h| 2 20
The same "plus if there's a matching key, insert if not" behaviour works for dictionaries too:
q)(`a`b!1 2)+`a`c!10 30
a| 11
b| 2
c| 30
got it :)
q) (x pj y), 0!select from y where not ([]a;b) in key 2!x
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
Always open for a better implementation :D I am sure there is one.

summarise (avg) table (keyed) for each row

Given a keyed table, e.g.:
q)\S 7 / seed random numbers for reproducibility
q)v:flip (neg[d 0]?`1)!#[;prd[d]?12] d:4 6 / 4 cols 6 rows
q)show kt:([]letter:d[1]#.Q.an)!v
letter| c g b e
------| ----------
a | 11 0 3 9
b | 11 8 10 0
c | 7 2 2 3
d | 8 4 9 6
e | 0 0 5 0
f | 1 0 0 11
How to calculate an average for each row --- e.g. (c+g+b+e)%4 --- for any number of columns?
Following on from your own solution, note that you have to be a little careful with null handling. Your approach won't ignore nulls in the way that avg normally would.
q).[`kt;("a";`g);:;0N];
q)update av:avg flip value kt from kt
letter| c g b e av
------| ---------------
a | 11 3 9
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
To make it ignore nulls you have to avg each row rather than averaging the flip.
q)update av:avg each value kt from kt
letter| c g b e av
------| -------------------
a | 11 3 9 7.666667
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
Solution 1: q-sql
q)update av:avg flip value kt from kt
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
Solution 2: functional q-sql
tl;dr:
q)![kt;();0b;](1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
let's start with a look how the parse tree of a non-general solution would look like:
q)parse"update av:avg (c;g;b;e) from kt"
!
`kt
()
0b
(,`av)!,(avg;(enlist;`c;`g;`b;`e))
(note that q is a wrapper implemented in k, so the , prefix operator in the above expression is the same as enlist keyword in q)
so all the below are equivalent (verify with ~). relying on projection: (x;y)~(x;)y, we can further improve the readability by reducing the distance between parens:
q)k)(!;`kt;();0b;(,`av)!,(avg;(enlist;`c;`g;`b;`e)))
q)(!;`kt;();0b;(enlist`av)!enlist(avg;(enlist;`c;`g;`b;`e)))
q)(!;`kt;();0b;(1#`av)!enlist(avg;(enlist;`c;`g;`b;`e)))
q)(!;`kt;();0b;)(1#`av)!enlist(avg;)(enlist;`c;`g;`b;`e)
let's evaluate the parse tree to check:
q)eval(!;`kt;();0b;)(1#`av)!enlist(avg;)(enlist;`c;`g;`b;`e)
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
(enlist;`c;`g;`b;`e) in the general case is:
q)enlist,cols[`kt]except cols key`kt
enlist
`c
`g
`b
`e
so let's plug in and check:
q)eval(!;`kt;();0b;(1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt)
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
also:
q)![`kt;();0b;(1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt]
q)![ kt;();0b;](1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt

Using a growth formula for grouped observations

I have a dataset which is shown below:
clear
input year price growth id
2008 5 -0.444 1
2009 . . 1
2010 7 -0.222 1
2011 9 0 1
2011 8 -0.111 1
2012 9 0 1
2013 11 0.22 1
2012 10 0 2
2013 12 0.2 2
2013 . . 2
2014 13 0.3 2
2015 17 0.7 2
2015 16 0.6 2
end
I want to generate variable growth which is the growth of price. The growth formula is:
growth = price of second-year - price of base year / price of base year
The base year is always 2012.
How can I generate this growth variable for each group of observation (by id)?
The base price can be picked out directly by egen:
bysort id: egen price_b = total(price * (year == 2012))
generate wanted = (price - price_b) / price_b
Notice that total is used along with the assumption that, for each id, you have only one observation with year = 2012.
The following works for me:
bysort id: generate obs = _n
generate double wanted = .
levelsof id, local(ids)
foreach x of local ids {
summarize obs if id == `x' & year == 2012, meanonly
bysort id: replace wanted = (price - price[`=obs[r(min)]']) / ///
price[`=obs[r(min)]'] if id == `x'
}
If the id values are consecutive, then the following will be faster:
forvalues i = 1 / 2 {
summarize obs if id == `i' & year == 2012, meanonly
bysort id: replace wanted = (price - price[`=obs[r(min)]']) / ///
price[`=obs[r(min)]'] if id == `i'
}
Results:
list, sepby(id)
+-----------------------------------------------+
| year price growth id obs wanted |
|-----------------------------------------------|
1. | 2008 5 -.444 1 1 -.44444444 |
2. | 2009 . . 1 2 . |
3. | 2010 7 -.222 1 3 -.22222222 |
4. | 2011 9 0 1 4 0 |
5. | 2011 8 -.111 1 5 -.11111111 |
6. | 2012 9 0 1 6 0 |
7. | 2013 11 .22 1 7 .22222222 |
|-----------------------------------------------|
8. | 2012 10 0 2 1 0 |
9. | 2013 12 .2 2 2 .2 |
10. | 2013 . . 2 3 . |
11. | 2014 13 .3 2 4 .3 |
12. | 2015 17 .7 2 5 .7 |
13. | 2015 16 .6 2 6 .6 |
+-----------------------------------------------+

Value of column, based on function over another column in Matlab table

I'm interested in the value of result that is in the same row as the min value of each column (and I have many columns, so I would like to loop over them, or do rowfun but I do not know how to get 'result' then).
Table A
+----+------+------+----+------+------+--------+
| x1 | x2 | x3 | x4 | x5 | x6 | result |
+----+------+------+----+------+------+--------+
| 1 | 4 | 10 | 3 | 12 | 2 | 8 |
| 10 | 2 | 8 | 1 | 12 | 3 | 10 |
| 5 | 10 | 5 | 4 | 2 | 10 | 12 |
+----+------+------+----+------+------+--------+
Solution
8 10 12 10 12 8
I know that I can apply rowfun, but then I don't know how to get result.
And then, I can do this, but cannot loop over all the columns:
A(cell2mat(A.x1) == min(cell2mat(A.x1)), 7)
and I have tried several ways of making this into a variable but I can't make it work, so that:
A(cell2mat(variable) == min(cell2mat(variable)), 7)
Thank you!
Assuming your data is homogeneous you can use table2array and the second output of min to index your results:
% Set up table
x1 = [1 10 5];
x2 = [4 2 10];
x3 = [10 8 5];
x4 = [3 1 4];
x5 = [12 12 2];
x6 = [2 3 10];
result = [8 10 12];
t = table(x1.', x2.', x3.', x4.', x5.', x6.', result.', ...
'VariableNames', {'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'result'});
% Convert
A = table2array(t);
% When passed a matrix, min finds minimum of each column by default
% Exclude the results column, assumed to be the last
[~, minrow] = min(A(:, 1:end-1));
solution = t.result(minrow)'
Which returns:
solution =
8 10 12 10 12 8
From the documentation for min:
M = min(A) returns the smallest elements of A.
<snip>
If A is a matrix, then min(A) is a row vector containing the minimum value of each column.

YUI 3 Chart Axis Label Positioning?+

I have created a chart using YUI and want to display axis labels. The x axis is fine, but the y axis label appears inside of the data.
Here is what is happening:
10 |
9 |
8 |
7 l |
6 a |
5 b | CHART
4 e |
3 l |
2 |
1 |
0 |_____________________________
Here is what I want to happen:
10 |
9 |
8 |
l 7 |
a 6 |
b 5 | CHART
e 4 |
l 3 |
2 |
1 |
0 |_____________________________
Here is my code for the chart axes:
var chartaxes = {
timeelapsed:{
position:"bottom",
type:"category",
title:"label"
},
kWh:{
position:"left",
type:"numeric",
title:"label",
}
};
Is there any way to fix this?
You need to set the series that is associated with the axes to display the title correctly using the keys variable.
var chartaxes = {
timeelapsed:{
position:"bottom",
type:"category",
title:"Time Elapsed (minutes)",
keys: ["category"],
and so on.