Data noise with PCA - matlab

I have a question related to data noise and principle component analysis (PCA).
Situation
I have a data matrix containing X, Y, Z joint data. I have applied PCA, with the stipulation of retaining 98% of the variance. However, even after reduction the data still remains very noise.
Problem
I have spent a few hours reading and I'm unsure of the best approach to take. I need to perform PCA for dimension reduction, however the noise present in the dataset still presents several issues. I need an intermediate step before applying PCA to reduce the noise contained in the dataset. I have been advised that Gaussian Smoothing might be the best way forward before applying PCA.
Can anyone suggest the best approach to take?
Edit
Apologise for not being clear in my question.
Original data: Here is an example of the original data. Projected: with 98% of the variance retained.
There is still a little noise in the projection. At least 4 points are not uniform in there positioning.

Related

Why is it important to transform the data into normal / Gaussian distribution when creating a linear regression model

I'm currently building my first regression model, and as we know that, owing to the limitations of the algorithm, we need to remove outliers and transform the distribution into a normal one.
I know that it's important and the ways to do it, but can someone please help me in understanding why exactly we need to do so? Why can't I work with a highly skewed distribution? Why does linear regression mandates this transformation in processing stage?
Classifier with and without Outliers in the data
Hope the above picture clears your doubt.
As LinearRegression model is optimized by passing through a path which has minimum squared error,
due to outliers (which are abnormal data points or noise) our classifier may deviate and work poorly on the test data (much general data).

Finding elongated clusters using MATLAB

Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.

Resampling data with minimal loss of information in time-domain

I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.
You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason

Principal Component Analysis in practice

I understand the concept of PCA, and what it's doing, but trying to apply the concept to my application is proving difficult.
I have a 1 by X matrix of a physiological signal (it's not EMG, but very similar, so think of it as EMG if it helps) which contains various noise and artefacts. What I've noticed of the noise is that some of it is very large and I would assume after PCA this would be the largest principal component, thus my idea of using PCA for some dimensional reduction.
My problem is that with a 1 by X matrix there is no covariance matrix, only the variance, and thus eigenvectors and all of PCA falls through.
I know I need to rearrange my data into a matrix more than 1D, but this is where I need some suggestions. Do I split my data into windows of equal length to create a large dimensional matrix which I can apply PCA to? Do I perform several trials of the same action so I have lots of data sets (this would be impractical for my application)?
Any suggestions or examples would be helpful. I'm using MATLAB to perform this task.

Analyzing data for noisy arrays

Using MATLAB I filtered a very noisy m x n array with a low-pass Gaussian filter, cleaned it up pretty well but still not well enough to analyze my data. What would the next step be? I'm thinking that signal enhancement, but am not sure how to go about this.
Update
Well, there are two different types of data sets actually; one is small peaks circular at base, around half a dozen pixels wide at base, noisy background with random noise. The other is the same thing but Gaussian and Poisson noise mainly. I tried filtering w/Gaussian low pass in both instances, worked to some extent as mentioned in the OP.
It is impossible to answer this without knowing what data you have, and what the noise is like.
Different problems will have different best solutions.