Creating a binary variable based on the median of another variable, grouping by two variables - group-by

In Stata I would like to create a binary variable median_unemp based on the median value of another variable unemp, grouping the calculation of the median value by region and year. That is, median_unempis 1 when the unemployment for that particular observation is greater than the median unemployment for the region and the year of the observation (and is 0 otherwise).
The code below generates my variable considering the entire dataset, but I want the median to be calculated by subgroups (by region and year):
webuse productivity.dta, clear
summarize unemp, detail
gen median_response = r(p50)
gen median_unemp = (unemp>=median_response)
replace median_unemp =. if unemp==.
On closer inspection of the data, I would like to know if unempfor observation 1 of my dataset (that is in region=1 and year=1970) is greater than the value of median_unemp (calculated for region=1 and year=1970) and so on. If it is greater than the median, then median_unemp==1. If it is lower than the median, then median_unemp==0.

webuse productivity.dta, clear
egen median_unemp = median(unemp), by(region year)
gen high_unemp = (unemp >= median_unemp) if unemp < .
In this dataset, there are no missing values for unemp but separating missings is good practice. Each median is the 5th of 9 values, so setting aside ties 4 values will be less than the median and 5 more than or equal to the median.

Related

Esper: Take the last value from each id and take the mean of all but the most extreme

I have 5 temperature sensors. I want to calculate the mean temperature of 4 - excluding the most extreme value (high or low).
Firstly: will std:unique(id) create a window of the last temperature readings for each id 1-5?
select
avg(tempEvent.temp) as meantemp
from
Event(id in (1, 2, 3, 4, 5)).std:unique(id) as tempEvent
Secondly: how could I change the select statement (possibly using an expression if necessary) to only calculate the mean of four values excluding the most extreme?
The background is, I want to know the deviations of each temperature from the average, but I don't want the average to include an anomalous id. Otherwise all temperatures will look like they are deviating from the average but really only one is.
Finding the average of the middle four values is simple enough, though not as elegant as your solution. The code below will work for any number of temps.
SELECT
AVG(temp) AS meantemp
FROM (
SELECT
temp,
COUNT(temp) AS c,
RANK () OVER (PARTITION BY temp ORDER BY temp) AS r
FROM
[table]
)
WHERE
r > 1
AND r < (c-1)
;
As for your second question, I'm not sure I understand. Do you want the value from among the four middle values that has the greatest absolute deviation from the mean of those four values?

MATLAB-How can I randomly select smaller values with higher probabilities?

I have a column vector "distances", and I want to select a value randomly from this vector such that smaller values have a higher probability of being selected. So far I am using the following, where "possible_cells" is the randomly selected value:
w=(fliplr(1:numel(distances)))/100
possible_cells=randsample((sort(distances)),1,true,w)
Basically, I flipped the distance vector to create probabilities of selection "w" (if I am understanding randsample correctly), so that the smallest value has the probability of being selected equal to the highest value. To check how well this works, I randomly drew 50 values and by using a histogram, I see that the values are higher than I would expect. Does anyone have any idea on how else to do what I described above?
0 Comments
How about something like this?
let's start with 10 sample distances with lengths no greater than 20 just to demonstrate:
d = randi(20,10,1);
Next, since we want smaller values to be more likely, let's take the reciprocal of those distances:
d_rec = 1./d;
Now, let's normalize so we can create a distribution from which to select our distance:
d_rec_norm = d_rec ./ sum(d_rec);
This new variable reflects the probability with which to select each given distance. Now comes a little trick... we choose the distance like this:
d_i = find(rand < cumsum(d_rec_norm),1);
This will give us the index of our chosen distance. The logic behind this is that when cumulatively summing the normalized values associated with each distance (d_rec_norm) we create "bins" whose widths are proportional to the likelihood of selecting each distance. All that is left is to pick a random number between 0 and 1 (rand) and see which "bin" it falls in.
I'm a new poster here, so let me know if this is unclear and I can try to improve my explanation.

How to get monthly totals from linearly interpolated data

I am working with a data set of 10,000s of variables which have been repeatedly measured since the 1980s. The first meassurements for each variable are not on the same date and the variables are irregularly measured - sometimes measurements are only a month apart, in a small number of cases they are decades apart.
I want to get the change in each variable per month.
So far I have a cell of dates of measurements,and interpolated rates of change between measurements (each cell represents a single variable in either, and I've only posted the first 5 cells in each array)
DateNumss= {[736614;736641;736669] [736636;736666] 736672 [736631;736659;736685] 736686}
LinearInterpss={[17.7777777777778;20.7142857142857;0] [0.200000000000000;0] 0 [2.57142857142857;2.80769230769231;0]}
How do I get monthly sums of the interpolated change in variable?
i.e.
If the first measurement for a variable is made on the January 1st, and the linearly interpolated change between that an the next measurement is 1 per day; and the next measurement is on Febuary the 5th and the corresponding linearly interpolated change is 2; then January has a total change of 1*31 (31 days at 1) and febuary has a total change of 1*5+2*23 (5 days at 1, 23 days at 2).
You would need the points in the serial dates that correspond with the change of a month.
mat(:,1)=sort(repmat(1980:1989,[1,12]));
mat(:,2)=repmat(1:12,[1,size(mat,1)/12]);
mat(:,3)=1;
monthseps=datenum(mat);
This gives you a list of all 120 changes of months in the eighties.
Now you want, for each month, the change per day, and sum it. If you take the original data it is easier, since you can just interpolate each day's value using matlab. If you only have the "LinearInterpss" you need to map it on the days using interp1 with the method 'previous'.
for ct = 2:length(monthseps)
days = monthseps(ct-1):(monthseps(ct)-1); %days in the month
%now we need each day assigned a certain change. This value depends on your "LinearInterpss". interp1 with method 'previous' searches LineairInterpss for the last value.
vals = interp1(DateNumss,LinearInterpss,days,'previous');
sum(vals); %the sum over the change in each day is the total change in a month
end

Stata: Keep only observations with minimum, maximum and median value of a given variable

In Stata, I have a dataset with two variables: id and var, and say 1000 observations. The variable var is of type float and takes distinct values for all observations. I would like to keep only the three observations where var is either the minimum of var, the maximum of var, or the median of var.
The way I currently do this:
summarize var, detail
local varmax = r(max)
local varmin = r(min)
local varmedian= r(p50)
keep if inlist(float(var),float(`varmax') , float(`varmedian'), float(`varmin'))
The problem that I face is that sometimes the inlist condition will not match one of the value. E.g. I end up with two observations instead of three, for instance the one with min and the one with max, but not the one with median. I suspect this has to do with a precision problem. As you see, I tried to convert all numbers to float, but this is apparently not sufficient.
Any fix to my solution, or alternative solution would be greatly appreciated (if possible without installing additional packages), thanks!
This is not in the first instance a precision problem.
It is an inevitable problem when (1) the number of values is even and (2) the median is the mean of two central values that are different. Then the median itself is not a value in the dataset and will not be found by keep.
Consider a data set 1, 2, 3, 4. The median 2.5 is not in the data. This is very common; indeed it is what is expected with all values distinct and the number of observations even.
Other problems can arise because two or even three of the minimum, median and maximum could be equal to each other. This is not your present problem, but it can bite with other variables (e.g. indicator variables).
Precision problems are possible.
Here is a general solution purported to avoid all these difficulties.
If you collapse to min, median. max and then reshape you can avoid the problem. You will always get three results, even if they are numerically equal and/or not present in the data.
In the trivial example below, the identifier is needed only to appease reshape. In other problems, you might want to collapse using by() and then your identifier comes ready-made. However, you will be less likely to want to reshape in that case.
. clear
. set obs 4
number of observations (_N) was 0, now 4
. gen y = _n
. collapse (min)ymin=y (max)ymax=y (median)ymedian=y
. gen id = _n
. reshape long y, i(id) j(statistic) string
(note: j = max median min)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 1 -> 3
Number of variables 4 -> 3
j variable (3 values) -> statistic
xij variables:
ymax ymedian ymin -> y
-----------------------------------------------------------------------------
. list
+---------------------+
| id statis~c y |
|---------------------|
1. | 1 max 4 |
2. | 1 median 2.5 |
3. | 1 min 1 |
+---------------------+
All that said, having (lots of?) datasets with just three observations sounds poor data management strategy. Perhaps this is extracted from some larger question.
UPDATE
Here is another way to keep precisely 3 observations. Apart from the minimum and maximum, we use the rule that we keep the "low median", i.e. the lower of two values averaged for the median, when the number of observations is even, and a single value that is the median otherwise. (In Stephen Stigler's agreeable terminology, we can talk of "comedians" in the first case.)
. sysuse auto, clear
(1978 Automobile Data)
. sort mpg
. drop if missing(mpg)
(0 observations deleted)
. keep if inlist(_n, 1, cond(mod(_N, 2), ceil(_N/2), floor(_N/2)), _N)
(71 observations deleted)
. l mpg
+-----+
| mpg |
|-----|
1. | 12 |
2. | 20 |
3. | 41 |
+-----+
mod(_N, 2) is 1 if _N is odd and 0 if _N is even. The expression in cond() selects ceil(_N/2) if the number of observations is odd and floor(_N/2) if it is even.

calculate monthly mean, 90th and 99th percentile of time series

I'm reading this article on wind speed trends and they specify in their methods that they tried to determine if there is a trend within the time series of monthly mean, 90th, and 99th percentile values of wind speed over the period shown. How would one achieve this? Furthermore, what does it mean by 90th and 99th percentile? My example:
v = datenum(1981, 1, 1):datenum(2010, 11, 31); % time vector
d = rand(1,length(v)); % data vector
% calculate mean, 90th and 99th percentile values
dateV = datevec(v); % date vector
[~,~,b] = unique(dateV(:,1:2),'rows');
monthly_v = accumarray(b,v,[],#mean);
monthly_d = accumarray(b,d,[],#mean);
I can calculate the monthly mean by the method shown above, but am not sure on how to calculate the 90th and 99th percentile (plus I'm not even sure what it is). Can anyone provide some information on this?
Use the prctile function. What you are seeking is a threshold where the proportion / percentage of input data that is exceeding this threshold is 100% - percentile. For example, if you sought the 90% quantile, you are trying to find a quantity in your input data where 10% of your data exceeded this quantity. For the 99% percentile, you are seeking the quantity in your input data where 1% of your data exceeded this threshold. You can simply call prctile by:
Y = prctile(X, P);
X is your data stored in vector form, and P is a vector or single number that lists the percentiles you desire. The output would be those thresholds that we just talked about, stored in Y.
In your case, v and d is your data you want to find the percentiles on per month, and thus you would modify your accumarray call like so:
monthly_v_90 = accumarray(b,v,[],#(x) prctile(x, 90));
monthly_v_99 = accumarray(b,v,[],#(x) prctile(x, 99));
monthly_d_90 = accumarray(b,d,[],#(x) prctile(x, 90));
monthly_d_99 = accumarray(b,d,[],#(x) prctile(x, 99));
What the above code will do is that for each unique month, you will calculate the 90% and 99% quantiles for v and d respectively. Specifically, monthly_v_90 and monthly_v_99 will give you the 90% and 99% quantiles for each month in a unique year for v while monthly_d_90 and monthly_d_99 will give you the 90% and 99% quantiles for each month in a unique year for d.
In your call to datevec, you are generating months from January 1981 to December 2010. Because there are 30 years in between, and there are 12 months in a year, you should have 360 element vectors with the above (as well as your calculations for the mean).