Why is it important to transform the data into normal / Gaussian distribution when creating a linear regression model - linear-regression

I'm currently building my first regression model, and as we know that, owing to the limitations of the algorithm, we need to remove outliers and transform the distribution into a normal one.
I know that it's important and the ways to do it, but can someone please help me in understanding why exactly we need to do so? Why can't I work with a highly skewed distribution? Why does linear regression mandates this transformation in processing stage?

Classifier with and without Outliers in the data
Hope the above picture clears your doubt.
As LinearRegression model is optimized by passing through a path which has minimum squared error,
due to outliers (which are abnormal data points or noise) our classifier may deviate and work poorly on the test data (much general data).

Related

Poorly calibrated probabilities but good classification in confusion matrix

I have an imbalanced data set. My goal is to balance sensitivity and specificity via the confusion matrix. I used glmnet in r with class weights. The model does well at balancing the sensitivity/specificity, but I looked at the calibration plot, and the probabilities are not well calibrated. I have read about calibrating probabilities, but I am wondering if it matters if my goal is to produce class predictions. If it does matter, I have not found a way to calibrate the probabilities when using caret::train().
This topic has been widely discussed, especially in some answers by Stephan Kolassa. I will try to summarize the main take-home messages for your specific question.
From a pure statistical point of view your interest should be on producing as output a probability for each class of any new data instance. As you deal with unbalanced data such probabilities can be small which however - as long as they are correct - is not an issue. Of course, some models can give you poor estimates of the class probabilities. In such cases, the calibration allows you to better calibrate the probabilities obtained from a given model. This means that whenever you estimate for a new observation a probability p of belonging to the target class, then p is indeed its true probability to be of that class.
If you are able to obtain a good probability estimator, then balancing sensitivity or specificity is not part of the statistical part of your problem, but rather of the decision component. Such the final decision will likely need to use some kind of threshold. Depending on the costs of type I and II errors, the cost-optimal threshold might change; however, an optimal decision might also include more than one threshold.
Ultimately, you really have to be careful about which is the specific need of the end-user of your model, because this is what is going to determine the best way of taking decisions using it.

Advice on Speeding up SciPy Custom Distribution Sampling & Fitting

I am trying to fit a custom distribution to a large (~O(500,000) measurements) dataset using scipy. I have derived a theoretical PDF based on some other factors, but both by hand and using symbolic integration software I cannot find an exact form of the CDF.
Currently, simply evaluating 1000 random samples from my custom distribution is expensive, which I believe is due to the need to invert an unknown CDF. If I cannot find an explicit form of the CDF and it's inverse, is there anything else I can do to speed up usage of this distribution?
I've used maple, matlab and Sympy to try and determine a CDF, yet none give a result. I also tried down-sampling my data whilst still retaining the tail attributes, but this still required so much data that doing anything with the distribution was slow.
My distribution is a sub-class of SciPy's rv_continuous class.
Thanks for any advice.
This sounds like you want to sample from a Kernel Density Estimation of the probability distribution. While Scipy does offer a Gaussian Kernel package, for that many measurements you would be much better off using sklearn's implementation. A good resource with code examples can be found on Jake VanderPlas's blog.

SVM Matlab classification

I'm approaching a 4 class classification problem, it's not particularly unbalanced, no missing features a lot of observation.. It seems everything good but when I approach the classification with fitcecoc it classifies everything as part of the first class. I try. to use fitclinear and fitcsvm on one vs all decomposed data but gaining the same results. Do you have any clue about the reason of that problem ?
Here are a few recommendations:
Have you normalized your data? SVM is sensitive to the features being
from different scales.
Save the mean and std you obtain during the training and use
those values during the prediction phase for normalizing the test
samples.
Change the C value and see if that changes the results.
I hope these help.

Data noise with PCA

I have a question related to data noise and principle component analysis (PCA).
Situation
I have a data matrix containing X, Y, Z joint data. I have applied PCA, with the stipulation of retaining 98% of the variance. However, even after reduction the data still remains very noise.
Problem
I have spent a few hours reading and I'm unsure of the best approach to take. I need to perform PCA for dimension reduction, however the noise present in the dataset still presents several issues. I need an intermediate step before applying PCA to reduce the noise contained in the dataset. I have been advised that Gaussian Smoothing might be the best way forward before applying PCA.
Can anyone suggest the best approach to take?
Edit
Apologise for not being clear in my question.
Original data: Here is an example of the original data. Projected: with 98% of the variance retained.
There is still a little noise in the projection. At least 4 points are not uniform in there positioning.

Fourier spectral analysis with Support Vector Machines

I did some reading this afternoon about SVM's. And have the hope that this looks very promising.
I am currently working on a problem, where I'm looking for a pattern in the fourier spectrum. What I'm saying is, that I have been looking at spectrums for days. I hope to find some repeating patterns. I found some criterias that match a certain pattern, but with the next sample, the whole pattern could look slightly different. So there is always slight deviation, which makes it hard to describe. Or in another way, I might be overlooking something. But I can clearly say, which is the training data.
I was hoping to make use of SVM to train it, and predict the classification. Means that if I have another set of new data, that it would tell me, that it matches the training data or it goes into the "other" group, which could be anything (no need to know).
Is that something a SVM is able to do, or am I completly off? I couldn't find any good examples of input data to see if my problem is something I could feed to SVM.
Currently using Matlab.
There actually has been tons of research done on this particular topic, but especially with Wavelet Transform. Google Wavelet Transform and SVM and you will find a number of papers. From there, you can easily go ahead with adjusting your model from Wavelet to FFT spectrum.
I don't have experience with SVM, but I do have experience with related techniques, and here's what I can say:
In all likelihood, you can't simply go from a spectrum to SVM to decision. You need to determine what it is about the spectrums that distinguish your various inputs. For example, if it's the way the data changes over time or the relationship between the high and low frequencies that makes the inputs different, you need to encode that a single parameter. Eg, you could make a parameter that's the ratio of some of your higher frequencies to some of your lower frequencies. You may also want to use parameters like frequency centroid and zero-crossing rate, which are simpler than the spectrum, but may still carry useful information (These are used in audio and speech. not sure if they apply to whatever you are looking at). Once you have these derived parameters, feed them to the SVM analysis, which will do the sorting.
Other techniques you might want to examine (which also have the same requirements) include HMM (Hidden Markov Models), K-Means, and Logistic Regression.