Repeat a specific value per firm to all years - copy

I have a panel data ranging from year 1917 to 1922 with various variables (for example Leverage) for 200 firms.
It looks something like this:
Year ID Leverage
1917 1 0.1
1918 1 0.2
1919 1 0.3
1917 2 0.4
1918 2 0.5
1917 3 0.6
1918 3 0.7
1919 3 0.8
1920 3 0.9
....
I want to copy the values for 1917 to all other years (1918, 1919,...) for the variables per FirmID. As in my example, not all years are present (so I cannot say that the value is there every X row). The result must be something like:
Year ID Leverage
1917 1 0.1
1918 1 0.1
1919 1 0.1
1917 2 0.4
1918 2 0.4
1917 3 0.6
1918 3 0.6
1919 3 0.6
1920 3 0.6
....

The following works for me:
clear
input year id leverage
1917 1 0.1
1918 1 0.2
1919 1 0.3
1917 2 0.4
1918 2 0.5
1917 3 0.6
1918 3 0.7
1919 3 0.8
1920 3 0.9
end
gen leverage1917 = leverage if year == 1917
bysort id: egen min = min(leverage1917)
replace leverage = min
drop min leverage1917
. list, sepby(id)
+----------------------+
| year id leverage |
|----------------------|
1. | 1917 1 .1 |
2. | 1918 1 .1 |
3. | 1919 1 .1 |
|----------------------|
4. | 1917 2 .4 |
5. | 1918 2 .4 |
|----------------------|
6. | 1917 3 .6 |
7. | 1918 3 .6 |
8. | 1919 3 .6 |
9. | 1920 3 .6 |
+----------------------+
EDIT NJC
This could be simplified to
generate leverage1917 = leverage if year == 1917
bysort id (leverage1917) : replace leverage1917 = leverage1917[1]
thus cutting out the egen call and the generation of another variable you then need to drop. This works properly even if there is no value for 1917 for some values of id.

Borrowing #Cybernike's helpful data example, here are two ways to do it in one line
clear
input year id leverage
1917 1 0.1
1918 1 0.2
1919 1 0.3
1917 2 0.4
1918 2 0.5
1917 3 0.6
1918 3 0.7
1919 3 0.8
1920 3 0.9
end
egen wanted1 = mean(cond(year == 1917, leverage, .)), by(id)
egen wanted2 = mean(leverage / (year == 1917)), by(id)
list, sepby(id)
+------------------------------------------+
| year id leverage wanted1 wanted2 |
|------------------------------------------|
1. | 1917 1 .1 .1 .1 |
2. | 1918 1 .2 .1 .1 |
3. | 1919 1 .3 .1 .1 |
|------------------------------------------|
4. | 1917 2 .4 .4 .4 |
5. | 1918 2 .5 .4 .4 |
|------------------------------------------|
6. | 1917 3 .6 .6 .6 |
7. | 1918 3 .7 .6 .6 |
8. | 1919 3 .8 .6 .6 |
9. | 1920 3 .9 .6 .6 |
+------------------------------------------+
For detailed discussion of both methods, see Sections 9 and 10 of this paper.
I don't overwrite the original data, contrary to your request. Often you decide later that you need them after all, or someone asks to see them.
This isn't necessarily better than the solution of #Cybernike. The division method behind wanted2 has struck some experienced users as too tricksy, and I tend to recommend the cond() device behind wanted1.

Related

Data analysis in Matlab

I have a time vector in Matlab which does not have consistent sampling time, ex. t = [0.1 0.2 0.3 0.4 0.5 0.6 0.7 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.9 3 3.1], I have another vector which is time based as a = [2 5 2 4 5 7 8 0 10 1 0 25 6 14 5 2 7 98], when I plot(t,a) there is straight line connecting the two points with larger sampling time, how can I remove these gaps where the sampling time is not consistent and it jumps to larger value? I know defining NaN between 0.7 and 1.3 and also 2 and 2.9 in t and also in a for the same interval might help, but how to distinguish if sampling time changes?
Maybe you can try the following codes, and here are two approaches that you can make it:
Approach 1: adding nan
clc;
clear;
close all;
t = [0.1 0.2 0.3 0.4 0.5 0.6 0.7 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.9 3 3.1]
a = [2 5 2 4 5 7 8 0 10 1 0 25 6 14 5 2 7 98]
dt = diff(t);
idx = find(dt > mode(dt));
tC = mat2cell(t',[idx(1),diff([idx,length(t)])]);
aC = mat2cell(a',[idx(1),diff([idx,length(t)])]);
nadd = dt(idx)/mode(dt);
T = [];
A = [];
for i = 1:length(nadd)
T = [T; tC{i};ones(int32(nadd(i)),1)*nan];
A = [A; aC{i};ones(int32(nadd(i)),1)*nan];
endfor
T = [T;tC{end}];
A = [A;aC{end}];
plot(T,A)
Approach 2: dividing vector by intervals
clc;
clear;
t = [0.1 0.2 0.3 0.4 0.5 0.6 0.7 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.9 3 3.1]
a = [2 5 2 4 5 7 8 0 10 1 0 25 6 14 5 2 7 98]
dt = diff(t);
idx = find(dt > mode(dt));
tC = mat2cell(t',[idx(1),diff([idx,length(t)])]);
aC = mat2cell(a',[idx(1),diff([idx,length(t)])]);
hold on;
arrayfun(#(k) plot(tC{k},aC{k}),1:(length(idx)+1));

Can i do several calculations in one calculation field?

Can i use calculation field for a multiple simple calculations?
Can't find it in documentation
SUM([Au 1])/SUM([AU])
SUM([Be 1])/SUM([BE])
syntax error
You can reference calculations from other calculated fields, but you cannot have multiple results with the same calculated field.
The example below works:
Col 1 | Col 2 | Col 3
1.00 | 1.00 | 2.00
Calculated Field: Ratio of Col 1 & Col 2
SUM([Col 1]) / SUM([Col 2]) = 1
Calculated Field: Ratio of Col 1 & Col 3
[Ratio of Col 1 & Col 2] / SUM([Col 3]) = 0.5
The example below will not work:
Col 1 | Col 2 | Col 3
1.00 | 1.00 | 2.00
Calculated Field: Ratio of Col 1 & Col 2 & Col 3
SUM([Col 1]) / SUM([Col 2])
SUM([Col 1]) / SUM([Col 3]) = Error
You could do this instead:
Col 1 | Col 2 | Col 3
1.00 | 1.00 | 2.00
Calculated Field: Ratio of Col 1 & Col 2 & Col 3
(SUM([Col 1]) / SUM([Col 2])) / SUM([Col 3]) = 0.5
Hope this helps.

(q/kdb+) Search column amounts

Can someone help me on the below?
nColss:1 3 4 4.5;
aa:([]amount:250000+500000*5?10;n1M:0.5*5?4;n3M:2+0.5*5?4;n4M:4+0.5*5?4;n4.5M:6+0.5*5?4);
aa:update nRng:{[l;n] (min l | l l bin n),(l l binr n & max l)}[nColss] each aa[`amount]%1000000 from aa;
aa:update nRng2:{`$("n",'string x),'"M"} each aa[`nRng] from aa;
amount n1M n3M n4M n4.5M nRng nRng2
250000 1.5 2 4 7 1 1f `n1M`n1M
2250000 0.5 2 5 6.5 1 3f `n1M`n3M
4250000 1.5 2.5 5 6 4 4.5 `n4M`n4.5M
250000 1 3.5 4.5 7.5 1 1f `n1M`n1M
1250000 1 2.5 4 7 1 3f `n1M`n3M
How can I generate a column nValue containing for each line the value of the columns specified in the nRng2 column?
Something like this
nValue
1.5 1.5
0.5 2
5 6
1 1
1 2.5
I was trying something like
aa[aa[`nRng2]]
that generates
index value
0 (1.5 0.5 1.5 1 1;1.5 0.5 1.5 1 1)
1 (1.5 0.5 1.5 1 1;2 2 2.5 3.5 2.5)
2 (4 5 5 4.5 4;7 6.5 6 7.5 7)
3 (1.5 0.5 1.5 1 1;1.5 0.5 1.5 1 1)
4 (1.5 0.5 1.5 1 1;2 2 2.5 3.5 2.5)
then I would need to take the diagonal of this matrix, but I am stuck at it.
I get slightly different values in the aa table when I enter your example code, but something like this seems to work:
q)aa[`nValue]:{x x`nRng2} each aa
q)aa
amount n1M n3M n4M n4.5M nRng nRng2 nValue
-----------------------------------------------------
4750000 0.5 2 4.5 6 4.5 4.5 n4.5M n4.5M 6 6
1250000 0 3.5 5 6 1 3 n1M n3M 0 3.5
3750000 0.5 2.5 5.5 7 3 4 n3M n4M 2.5 5.5
250000 1 3 5 6.5 1 1 n1M n1M 1 1
750000 0 3 4.5 7 1 1 n1M n1M 0 0
To give a quick explanation of what this is doing; by doing each aa we are essentially passing each record from the table into the lambda function as a dictionary (a table in kdb+ is simply a list of dictionaries). Within this we index into the record with nRng2 to get the column names, and then index into the dictionary again using those column names. We then assign this using index notation to add a new column

Fit a piecewise regression in matlab and find change point

In matlab, I want to fit a piecewise regression and find where on the x-axis the first change-point occurs. For example, for the following data, the output might be changepoint=20 (I don't actually want to plot it, just want the change point).
data = [1 4 4 3 4 0 0 4 5 4 5 2 5 10 5 1 4 15 4 9 11 16 23 25 24 17 31 42 35 45 49 54 74 69 63 46 35 31 27 15 10 5 10 4 2 4 2 2 3 5 2 2];
x = 1:52;
plot(x,data,'.')
If you have the Signal Processing Toolbox, you can directly use the findchangepts function (see https://www.mathworks.com/help/signal/ref/findchangepts.html for documentation):
data = [1 4 4 3 4 0 0 4 5 4 5 2 5 10 5 1 4 15 4 9 11 16 23 25 24 17 31 42 35 45 49 54 74 69 63 46 35 31 27 15 10 5 10 4 2 4 2 2 3 5 2 2];
x = 1:52;
ipt = findchangepts(data);
x_cp = x(ipt);
data_cp = data(ipt);
plot(x,data,'.',x_cp,data_cp,'o')
The index of the change point in this case is 22.
Plot of data and its change point circled in red:
I know this is an old question but just want to provide some extra thoughts. In Maltab, an alternative implemented by me is a Bayesian changepoint detection algorithm that estimates not just the number and locations of the changepoints but also reports the occurrence probability of changepoints. In its current implementation, it deals with only time-series-like data (aka, 1D sequential data). More info about the tool is available at this FileExchange entry (https://www.mathworks.com/matlabcentral/fileexchange/72515-bayesian-changepoint-detection-time-series-decomposition).
Here is its quick application to your sample data:
% Automatically install the Rbeast or BEAST library to local drive
eval(webread('http://b.link/beast')) %
data = [1 4 4 3 4 0 0 4 5 4 5 2 5 10 5 1 4 15 4 9 11 16 23 25 24 17 31 42 35 45 49 54 74 69 63 46 35 31 27 15 10 5 10 4 2 4 2 2 3 5 2 2];
out = beast(data, 'season','none') % season='none': there is no seasonal/periodic variation in the data
printbeast(out)
plotbeast(out)
Below is a summary of the changepoint, given by printbeast():
#####################################################################
# Trend Changepoints #
#####################################################################
.-------------------------------------------------------------------.
| Ascii plot of probability distribution for number of chgpts (ncp) |
.-------------------------------------------------------------------.
|Pr(ncp = 0 )=0.000|* |
|Pr(ncp = 1 )=0.000|* |
|Pr(ncp = 2 )=0.000|* |
|Pr(ncp = 3 )=0.859|*********************************************** |
|Pr(ncp = 4 )=0.133|******** |
|Pr(ncp = 5 )=0.008|* |
|Pr(ncp = 6 )=0.000|* |
|Pr(ncp = 7 )=0.000|* |
|Pr(ncp = 8 )=0.000|* |
|Pr(ncp = 9 )=0.000|* |
|Pr(ncp = 10)=0.000|* |
.-------------------------------------------------------------------.
| Summary for number of Trend ChangePoints (tcp) |
.-------------------------------------------------------------------.
|ncp_max = 10 | MaxTrendKnotNum: A parameter you set |
|ncp_mode = 3 | Pr(ncp= 3)=0.86: There is a 85.9% probability |
| | that the trend component has 3 changepoint(s).|
|ncp_mean = 3.15 | Sum{ncp*Pr(ncp)} for ncp = 0,...,10 |
|ncp_pct10 = 3.00 | 10% percentile for number of changepoints |
|ncp_median = 3.00 | 50% percentile: Median number of changepoints |
|ncp_pct90 = 4.00 | 90% percentile for number of changepoints |
.-------------------------------------------------------------------.
| List of probable trend changepoints ranked by probability of |
| occurrence: Please combine the ncp reported above to determine |
| which changepoints below are practically meaningful |
'-------------------------------------------------------------------'
|tcp# |time (cp) |prob(cpPr) |
|------------------|---------------------------|--------------------|
|1 |33.000000 |1.00000 |
|2 |42.000000 |0.98271 |
|3 |19.000000 |0.69183 |
|4 |26.000000 |0.03950 |
|5 |11.000000 |0.02292 |
.-------------------------------------------------------------------.
Here is the graphic output. Three major changepoints are detected:
You can use sgolayfilt function, that is a polynomial fit to the data, or reproduce OLS method: http://www.utdallas.edu/~herve/Abdi-LeastSquares06-pretty.pdf (there is a+bx notation instead of ax+b)
For linear fit of ax+b:
If you replace x with constant vector of length 2n+1: [-n, ... 0 ... n] on each step, you get the following code for sliding regression coeffs:
for i=1+n:length(y)-n
yi = y(i-n : i+n);
sum_xy = sum(yi.*x);
a(i) = sum_xy/sum_x2;
b(i) = sum(yi)/n;
end
Notice that in this code b means sliding average of your data, and a is a least-square slope estimate (first derivate).

3D or 4D interpolate to find the corrsponding values based on 4 column of variables

I'm trying to find out whether if its possible to find/interpolate to calculate the corresponding values from this set of variables
+-------------+-------------+------+------+
| x | y | z | g |
+-------------+-------------+------+------+
| 150.8385804 | 183.7613678 | 0.58 | 2 |
| 171.0745381 | 231.7033081 | 2 | 0.58 |
| 179.1394672 | 244.5019837 | 0.8 | 0.8 |
| 149.1849453 | 180.7103271 | 0.8 | 2 |
| 162.5648017 | 212.8121033 | 2 | 0.8 |
| 141.1687115 | 163.4759979 | 0.8 | 3 |
| 140.7505385 | 162.7905884 | 0.9 | 3 |
| 148.1461022 | 180.5486908 | 1.8 | 1.6 |
| 147.1552106 | 178.7599182 | 2 | 1.6 |
+-------------+-------------+------+------+
What would be the corresponding z and g for x=143 and y=179? I do have access to matlab if anyone can suggest a code for it.
Here is the MATLAB syntax to load the above data into your workspace:
X = [150.8385804 171.0745381 179.1394672 149.1849453 162.5648017 141.1687115 140.7505385 148.1461022 147.1552106].';
Y = [183.7613678 231.7033081 244.5019837 180.7103271 212.8121033 163.4759979 162.7905884 180.5486908 178.7599182].';
Z = [0.58 2 0.8 0.8 2 0.8 0.9 1.8 2].';
G = [2 0.58 0.8 2 0.8 3 3 1.6 1.6].';
You can use scatteredInterpolant to do this for you. scatteredInterpolant is used to perform interpolation on a scattered dataset, which is basically what you have. Actually, you can do it twice: Once for z and once for g. You specify x and y as key / control points with the corresponding z and g output points. scatteredInterpolant will create an object for you, and you can specify custom x and y values for each of the z and g scatteredInterpolants and it will give you an interpolated answer. The default interpolation method is linear. As such, you'd specify x=143 and y=179 and figure out what the output z and g are.
In other words:
X = [150.8385804 171.0745381 179.1394672 149.1849453 162.5648017 141.1687115 140.7505385 148.1461022 147.1552106].';
Y = [183.7613678 231.7033081 244.5019837 180.7103271 212.8121033 163.4759979 162.7905884 180.5486908 178.7599182].';
Z = [0.58 2 0.8 0.8 2 0.8 0.9 1.8 2].';
G = [2 0.58 0.8 2 0.8 3 3 1.6 1.6].';
%// Create scatteredInterpolant
Zq = scatteredInterpolant(X, Y, Z);
Gq = scatteredInterpolant(X, Y, G);
%// Figure out interpolated values
zInterp = Zq(143, 179);
gInterp = Gq(143, 179);