Rsq-Adjusted In the R-squared Adjusted metric, R-squared is “adjusted” or modified according to: - linear-regression

Rsq-Adjusted
In the R-squared Adjusted metric, R-squared is “adjusted” or modified according to:
1.Number of predictors
2.Sample size
3.Number of missing values in data
4.All of the above

1.Number of predictors
2.Sample size

Related

Creating a binary variable based on the median of another variable, grouping by two variables

In Stata I would like to create a binary variable median_unemp based on the median value of another variable unemp, grouping the calculation of the median value by region and year. That is, median_unempis 1 when the unemployment for that particular observation is greater than the median unemployment for the region and the year of the observation (and is 0 otherwise).
The code below generates my variable considering the entire dataset, but I want the median to be calculated by subgroups (by region and year):
webuse productivity.dta, clear
summarize unemp, detail
gen median_response = r(p50)
gen median_unemp = (unemp>=median_response)
replace median_unemp =. if unemp==.
On closer inspection of the data, I would like to know if unempfor observation 1 of my dataset (that is in region=1 and year=1970) is greater than the value of median_unemp (calculated for region=1 and year=1970) and so on. If it is greater than the median, then median_unemp==1. If it is lower than the median, then median_unemp==0.
webuse productivity.dta, clear
egen median_unemp = median(unemp), by(region year)
gen high_unemp = (unemp >= median_unemp) if unemp < .
In this dataset, there are no missing values for unemp but separating missings is good practice. Each median is the 5th of 9 values, so setting aside ties 4 values will be less than the median and 5 more than or equal to the median.

MSE in neuralnet results and roc curve of the results

Hi my question is a bit long please bare and read it till the end.
I am working on a project with 30 participants. We have two type of data set (first data set has 30 rows and 160 columns , and second data set has the same 30 rows and 200 columns as outputs=y and these outputs are independent), what i want to do is to use the first data set and predict the second data set outputs.As first data set was rectangular type and had high dimension i have used factor analysis and now have 19 factors that cover up to 98% of the variance. Now i want to use these 19 factors for predicting the outputs of the second data set.
I am using neuralnet and backpropogation and everything goes well and my results are really close to outputs.
My questions :
1- as my inputs are the factors ( they are between -1 and 1 ) and my outputs scale are between 4 to 10000 and integer , should i still scaled them before running neural network ?
2-I scaled the data ( both input and outputs ) and then predicted with neuralnet , then i check the MSE error it was so high like 6000 while my prediction and real output are so close to each other. But if i rescale the prediction and outputs then check The MSE its near zero. Is it unbiased to rescale and then check the MSE ?
3- I read that it is better to not scale the output from the beginning but if i just scale the inputs all my prediction are 1. Is it correct to not to scale the outputs ?
4- If i want to plot the ROC curve how can i do it. Because my results are never equal to real outputs ?
Thank you for reading my question
[edit#1]: There is a publication on how to produce ROC curves using neural network results
http://www.lcc.uma.es/~jja/recidiva/048.pdf
1) You can scale your values (using minmax, for example). But only scale your training data set. Save the parameters used in the scaling process (in minmax they would be the min and max values by which the data is scaled). Only then, you can scale your test data set WITH the min and max values you got from the training data set. Remember, with the test data set you are trying to mimic the process of classifying unseen data. Unseen data is scaled with your scaling parameters from the testing data set.
2) When talking about errors, do mention which data set the error was computed on. You can compute an error function (in fact, there are different error functions, one of them, the mean squared error, or MSE) on the training data set, and one for your test data set.
4) Think about this: Let's say you train a network with the testing data set,and it only has 1 neuron in the output layer . Then, you present it with the test data set. Depending on which transfer function (activation function) you use in the output layer, you will get a value for each exemplar. Let's assume you use a sigmoid transfer function, where the max and min values are 1 and 0. That means the predictions will be limited to values between 1 and 0.
Let's also say that your target labels ("truth") only contains discrete values of 0 and 1 (indicating which class the exemplar belongs to).
targetLabels=[0 1 0 0 0 1 0 ];
NNprediction=[0.2 0.8 0.1 0.3 0.4 0.7 0.2];
How do you interpret this?
You can apply a hard-limiting function such that the NNprediction vector only contains the discreet values 0 and 1. Let's say you use a threshold of 0.5:
NNprediction_thresh_0.5 = [0 1 0 0 0 1 0];
vs.
targetLabels =[0 1 0 0 0 1 0];
With this information you can compute your False Positives, FN, TP, and TN (and a bunch of additional derived metrics such as True Positive Rate = TP/(TP+FN) ).
If you had a ROC curve showing the False Negative Rate vs. True Positive Rate, this would be a single point in the plot. However, if you vary the threshold in the hard-limit function, you can get all the values you need for a complete curve.
Makes sense? See the dependencies of one process on the others?

Measure the STD of RMSE

I'm working on a time series forecasting problem and I would like to confirm if it makes sense to compute the standard deviation of the root mean squared error. If so, is this the correct way?
STD_test = std(sqrt((y_real-y_pred).^2))
Also, imagine that the output of the model is 100, the RMSE 20 and the STD 10. This means that the real value is between [70,120] ?
The term y_real-y_pred is the vector of errors. The expression squares each element of it, and then sqrts each element of it, thus having the effect of abs(). Then std() is run on the vector of errors. Thus, this is computing the S.D. of the (absolute) error. That is a meaningful metric, but unlikely to be what you are after. Try:
e = y_real-y_pred;
MSE = mean(e.^2);
RMSE = sqrt(MSE);
sd = std(RMSE);
That will compute what you want. However, since RMSE is a scalar value, the value sd will be zero, so to answer the first part of your question, no it is not meaningful. What is meaningful is to look at the s.d. of the error itself:
sd = std(e);
RMSE and s.d. are somewhat related but they are distinct.
Your RMSE is fine; but the final conclusion is not! A std of 10 means there's a roughly 68% chance that your output lies within +- std. You can refer to this wiki link to learn more about the rule.

Matlab: how to find fundamental frequency from list of energy peaks

In a spectrogram, I have a set of harmonic frequencies (peaks in the spectrum) for a given time frame:
5215
3008.1
2428.1
2214.9
1630.2
1315
997.01
881.39
779.04
667.47
554.21
445.77
336.39
237.69
124.6
If I do -diff(ans), I get the differences between the formants, which hint me to the fact that the fundamental frequency f_0 of this frame is around 110 Hz:
2206.9
580.06
213.11
584.72
315.24
317.97
115.62
102.35
111.57
113.26
108.44
109.38
98.705
113.08
It is clear that the last 9 values of the first list are harmonics of the same f_0, because the last 8 values of the second list are around the same value. Their mean is 109.05 (but I'm not sure if that is the correct f_0). How can I calculate f_0 in a neat function?
I found an answer myself: I calculate the difference between the two peaks with the lowest frequency values and with energy values above a certain threshold. Then, I check if that difference is (within a certain range) in the list of frequencies.

Matlab, smaller duration histogram

I have this histogram plot. It show histogram for every 100 duration. I want to show histogram in smaller duration for example every 10 .How can I do this in Matlab?Thanks.
Use
hist(data,nbins)
to specify the number of bins. Default is 10, so if you want to have it split not by 100 but by 10 use:
hist(data,100)
In addition to the answer by #slezadav, if you want to set a given bin width (10 in your example) you can use something like
hist(data,5:10:995)
Using a vector as the second argument of hist specifies bin centers.
As explained in the docs:
use the nbins argument of the hist function:
rng(0,'twister')
data = randn(1000,1);
figure
nbins = 5;
hist(data,nbins)
you can check this by changing the parameter of nbins.
See also here: http://www.mathworks.de/de/help/matlab/ref/hist.html