interpret results linear regression matlab

interpret results linear regression matlab - matlab

I am trying to fit a model having as predictor the variables TNST and Seff and as response the variable AUCMET.
The result of the fitting is:
mdl1 =
Linear regression model:
AUCMET ~ 1 + TNST + Seff
Estimated Coefficients:
Estimate SE tStat pValue
(Intercept) 1251.5 72.176 17.34 1.4406e-58
TNST -2.3058 0.16045 -14.371 1.9579e-42
Seff 13.087 1.0748 12.176 9.4907e-32
Number of observations: 932, Error degrees of freedom: 929
Root Mean Squared Error: 322
R-squared: 0.197, Adjusted R-Squared 0.195
F-statistic vs. constant model: 114, p-value = 5.36e-45
The result from the anova analisis is
anova(mdl1)
ans =
SumSq DF MeanSq F pValue
TNST 2.1395e+07 1 2.1395e+07 206.52 1.9579e-42
Seff 1.5359e+07 1 1.5359e+07 148.25 9.4907e-32
Error 9.6243e+07 929 1.036e+05
The output of the diagnostic plot is
plotDiagnostics(mdl)
Could you help me to interpret this result? I see that all the p are < 0.05 so they variables are important for the model.
Is it a good model? what should I look at to understand it?

The r-squared / adjusted r-squared are the Pearson correlation coefficient. https://en.m.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
A 1 is good a 0 is bad so I'd say it's a poetry bad model.

Edit: Now that you edited the question with new information:
1- From the plot diagnostic test it can be seen that there are a percentage of points with high leverage. But this plot does not reveal whether the high-leverage points are outliers. Try plotDiagnostics(mdl,'cookd') to find the outliers (points with large Cook's distance) and remove them from the data.
2- The ANOVA table shows that both variables are important and you cannot consider removing them.
Is a Low R-squared Bad?
No. In fields such as predicting human behavior (e.g. psychology), R-squared values are low because the human's behavior are hard to predict.
Also, if the obtained R-squared is low but the prediction is good, the model counts as a good model. So a low R-squared doesn't necessarily affect the interpretation of significant variables. How high should the R-squared be for prediction? Well, that depends on your requirements for the width of a prediction interval and how much variability is present in your data. While a high R-squared is required for precise predictions, it’s not sufficient by itself, as we shall see. On the other hand, High R-squared Values are not Inherently Good. A high R-squared does not necessarily indicate that the model has a good fit. (read more)
What to do next?
To examine the quality of the model you can perform other tests, such as
ANOVA
To examine the quality of the fitted model, consult an ANOVA table.
tbl = anova(mdl)
Diagnostic plots
Diagnostic plots help you identify outliers, and see other problems in your model or fit.
plotDiagnostics(mdl)
Residuals
There are several residual plots to help you discover errors, outliers, or correlations in the model or data. The simplest residual plots are the default histogram plot, which shows the range of the residuals and their frequencies, and the probability plot, which shows how the distribution of the residuals compares to a normal distribution with matched variance.
plotResiduals(mdl)
And more

Related

How to transform simulated innovations to financial returns with GJR GARCH model?

I am investigating the time varying dependence between financial return series using copula theory.
For each marginal time series I have fitted a GJR GARCH model with t-distributed innovations and extracted the standardized residuals.
These residuals I use as input for my copula models.
Now, with the estimated copula models I have simulated 1000 standardized residuals for each point t.
I have already transformed these simulated uniformly distributed residuals back to their t-distribution.
I am wondering how I can transform these simulated t-distributed residuals back to returns with the GJR GARCH model for each point in time.
Thank you in advance!

Function approximation by ANN

So I have something like this,
y=l3*[sin(theta1)*cos(theta2)*cos(theta3)+cos(theta1)*sin(theta2)*cos(theta3)-sin(theta1)*sin(theta2)*sin(theta3)+cos(theta1)*cos(theta2)sin(theta3)]+l2[sin(theta1)*cos(theta2)+cos(theta1)*sin(theta2)]+l1*sin(theta1)+l0;
and something similar for x. Where thetai is angles from specified interval and li some coeficients. Task is approximate inversion of equation, so you set x and y and result will be appropriate theta. So I random generate thetas from specified intervals, compute x and y. Then I norm x and y between <-1,1> and thetas between <0,1>. This data I used as training set in such way, inputs of network are normalized x and y, outputs are normalized thetas.
I train the network, tried different configuration and absolute error of network was still around 24.9% after whole night of training. It's so much, so I don't know what to do.
Bigger training set?
Bigger network?
Experiment with learning rate?
Longer training?
Technical info
As training algorithm was used error back propagation. Neurons have sigmoid activation function, units are biased. I tried topology: [2 50 3], [2 100 50 3], training set has length 1000 and training duration was 1000 cycle(in one cycle I go through all dataset). Learning rate has value 0.2.
Error of approximation was computed as
sum of abs(desired_output - reached_output)/dataset_lenght.
Used optimizer is stochastic gradient descent.
Loss function,
1/2 (desired-reached)^2
Network was realized in my Matlab template for NN. I know that is weak point, but I'm sure my template is right because(successful solution of XOR problem, approximation of differential equations, approximation of state regulator). But I show this template, because this information may be useful.
Neuron class
Network class
EDIT:
I used 2500 unique data within theta ranges.
theta1<0, 180>, theta2<-130, 130>, theta3<-150, 150>
I also experiment with larger dataset, but accuracy doesn't improve.

How can I validate my estimated covariance matrix?

I have two covariances of size 6*6, one is supposed be the true covariance and the other is the Maximum likelihood estimate for my covariance. Is there any way I could validate my estimated covariance?

I don't know how exactly you determined your covariance matrix, but generally it is a good first step to check the confidence intervals of your estimators.
Heuristically speaking a wide confidence interval suggests that your estimator has a lot of uncertainty.
Take a look at the Matlab function corrcoef, which also gives lower and upper bounds for the estimated correlation coefficients,
cf. https://uk.mathworks.com/help/matlab/ref/corrcoef.html#bunkanr .
Maybe using this function on your data gives you a good starting point. If you use your own function to estimate the ML estimators, you will have to add the confidence intervals yourself.

How many iterations should you make for the simulation to be a 'Monte Carlo simulation' for BER calculations?

Edited question
How many iterations should you make for the simulation to be an accurate 'Monte Carlo simulation' for Bit error rate calculations?
What is the minimum value? If I want to repeat the simulation by an exponentially growing number for five times? should I start from 1e2 thus>> iterations = [1e2 1e3 1e4 1e5 1e6] or 1e3 >> [1e3 1e4 1e5 1e6 1e7]? or something else? what is the common practice?
Additional info:
I used [8e3 1e4 3e4 5e4 8e4 1e5] before but that is not enough according to the prof. because the result is not satisfactory.
Simulations take a very long time on my computer so I cannot keep changing the iterations based on the result. If there is a common practice about this, please let me know.
Thanks #BillBokeey for helping me edit the question.

What your professor propose strikes me as qualitative, but not quantitative way to estimate the convergence of your simulation.
Frankly, I don't know how BER is computed, but I deal a lot with some integral calculations by MC.
In such case you sample xi over some interval and compute
fMC = Si fi / N, where S denotes summation. We know that fMC will converge to true value with variance of sigma2/N (or std.deviation of sigma/sqrt(N)). What do we do then, we compute in the same simulation estimation of sigma, assume for large enough N to be good approximation of sigma and get simulation error plotted. IN practical terms alongside with fMC we compute second momentum sum and average as f2MC = Si f2i / N, and at the end get s=sqrt(f2MC - (fMC)2)/sqrt(N) as estimated error of the MC simulation (it will be a bit biased though).
Thus you could plot on the same graph value of BER and statistical error of the simulation. You could even do better - ask user to input required statistical error (say, in %, meaning user enters s/f*100), and continue simulation in bunches till you reach required precision.
THen you could judge if 109 points are enough or not...

Assuming that we denote our simulated BER as Pb_hat and that Pb_hat in [(1 - alpha)Pb, (1 + alpha)Pb], where Pb is the true BER, and alpha is the percent deviation tolerance (e.g., 0.1), then from [van Trees 2013, pg. 83] we know that the number of Monte Carlo trials required to obtain Pb_hat with a confidence probability pc is K=(c / alpha)^2 x (1-Pb) / Pb,
with c given in Table I.
Table I: confidence interval probabilities from the Gaussian distribution
pc
0.900
0.950
0.954
0.990
0.997
c
1.645
1.960
2.000
2.576
3.000
Example: Suppose we want to simulate a BER of 10^-4 with a percent deviation tolerance of 0.01 and a confidence probability 0.950, then from Table I we know that c = 1.960 and by applying the formula K = (1.96/0.01)^2 x (1-10^-4)/10^-4 = 384121584 Monte Carlo trials. This is a surprisingly large value, though.
As a rule of thumb, K should be on the order of 1O/BER [Jeruchim 1984]
[van Trees 2013] H. L. van Trees, K. L. Bell, and Z. Tian, Detection, estimation, and filtering theory, 2nd ed., Hoboken, NJ: Wiley, 2013.
[Jeruchim 1984] M. Jeruchim, "Techniques for Estimating the Bit Error Rate in the Simulation of Digital Communication Systems," in IEEE Journal on Selected Areas in Communications, vol. 2, no. 1, pp. 153-170, January 1984, doi: 10.1109/JSAC.1984.1146031.

FFT: Match samples to frequency

let us assume,
I have a vector t with the times in seconds of my samples. (These samples are not equally distributed on the time domain.
Also I have a vector data containing the samplevalues at the time t.
t and data have the same length.
If I plot the graph some sort of periodical signal is obtained.
now I could perform: abs(fft(data)) to get my spectrum, which is then plotted over the amount of data points on the x-axis.
How can I obtain my spectrum regarding the times in vector t and plot it?
I want to see which frequencies in 1/s or which period in s my signal contains.
Thanks for your help.

[Not the OP's intention]: FFT will give you the spectrum (global) for any number of input data points. You cannot have a specific data point (in time) associated with parts (or the full) spectrum.
What you can do instead is use spectrogram and obtain the Short-Time Fourier Transform (STFT). This will give you a NxM discrete grid of time-frequency FT values (N: FT frequency bins, M: signal time-windows).
By localizing the (overlapping) STFT windows on your data samples of interest you will get N frequency magnitude values, thus the distribution of short-term spectrum estimates as the signal changes in time.
See also the possibly relevant answer here: https://stackoverflow.com/a/12085728/651951
EDIT/UPDATE:
For unevenly spaced data you need to consider the Non-Uniform DFT (and Non-uniform FFT implementations). See the relevant question/answer here https://scicomp.stackexchange.com/q/593
The primary approaches for NFFT or NUFFT, are based on creating a uniform grid through local convolutions/interpolation, running FFT on this and undoing the convolutional effect of the interpolation filter.
You can read more:
A. Dutt and V. Rokhlin, Fast Fourier transforms for nonequispaced data, SIAM J. Sci. Comput., 14, 1993.
L. Greengard and J.-Y. Lee, Accelerating the Nonuniform Fast Fourier Transform, SIAM Review, 46 (3), 2004.
Pippig, M. und Potts, D., Particle Simulation Based on Nonequispaced Fast Fourier Transforms, in: Fast Methods for Long-Range Interactions in Complex Systems, 2011.
For an implementation (with an interface to MATLAB) try NFFT and possibly its parallelized version PNFFT. You may find a nice walk-through on how to set-up and use here.

You can resample or interpolate your sample points to get another set of sample points that are equally spaced in t. The chosen spacing or sample rate of the second set of equally spaced sample points will allow you to infer frequencies to the result of an FFT of that second set.
The results may be noisy or include aliasing unless the initial data set is bandlimited to a sufficiently low frequency to allow interpolation. If bandlimited, then you might try something like cubic splines as an interpolation method.
Although it may look like one can get a high FFT bin frequency resolution by resampling to a larger number of data points, the actual useful resolution accuracy will be more related to the original number of samples.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse