PCA with correlated dimensions - matlab

I am trying to describe the limit cycle of the waveforms of the 'arms' of a swimming alga in terms of the shape scores of its principal components. So I have the shapes of the arms stored in terms of xy coordinates at nodes distributed along the arc length of the arm. I am trying to do a principal component analysis on this, but I am struggling a bit.
Before, I had the shapes described in terms of curvature along the arc length. Each curve had 25 nodes, so I got a nice 25x25 covariance matrix. The analysis was very straightforward, everything worked fine.
Now for reasons irrelevant here, it is more convenient to have the curves described in terms of x and y values of the nodes. So 25 nodes with an x and a y value. So 50 features per sample, although features 1:25 and 26:50 form 'sets'.
This can be viewed as a matrix of n samples with m nodes with k features (3D), or simply as a 2D matrix with n samples with k features, where x and y are separate features.
Just chaining the x and y vectors and doing PCA on that did not really help me - I don't understand what I am doing anymore. I get the basics of PCA, but how to do this for a more complex data set is beyond me. Also, I am not too great at matrix algebra, so a more intuitive explanation is welcome.
The question: Am I doing entirely the wrong thing or is there some way to retrieve shape modes of 25 nodes with an x and y value?

Related

How to apply a moving median filter on a time series of 2D scans in Matlab?

I have a huge set of data of a timelapse of 2D laser scans of waves running up and down stairs (see fig.1fig.2fig.3).
There is a lot of noise in the scans, since the water splashes a lot.
Now I want to smoothen the scans.
I have 2 questions:
How do I apply a moving median filter (as recommended by another study dealing with a similar problem)? I can only find instructions for single e.g. (x,y) or (t,y) plots but not for x and y values that vary over time. Maybe an average filter would do it as well, but I do not have a clue on that either.
The scanner is at a fixed point (222m) so all the data spikes point towards that point at the ceiling. Is it possible or necessary to include this into the smoothing process?
This is the part of the code (I hope it's enough to get it):
% Plot data as real time profile
x1=data.x;y1=data.y;
t=data.t;
% add moving median filter here?
h1=plot(x1(1,:),y1(1,:));
axis([210 235 3 9])
ht=title('Scanner data');
for i=1:1:length(t);
set(h1,'XData',x1(i,:),'YData',y1(i,:));set(ht,'String',sprintf('t = %5.2f
s',data.t(i)));pause(.01);end
The data.x values are stored in a (mxn) matrix in which the change in time is arranged vertically and the x values i.e. "laser points" of the scanner are horizontally arranged. The data.y is stored in the same way. The data.t values are stored in a (mx1) matrix.
I hope I explained everything clearly and that somebody can help me. I am already pretty desperate about it... If there is anything missing or confusing, please let me know.
If you're trying to apply a median filter in the x-y plane, then consider using medfilt2 from the Image Processing Toolbox. Note that this function only accepts 2-D inputs, so you'll have to loop over the third dimension.
Also note that medfilt2 assumes that the x and y data are uniformly spaced, so if your x and y data don't fall onto a uniformly spaced grid you may have to manually loop over indices, extract the corresponding patches, and compute the median.
If you can/want to apply an averaging filter instead of a median filter, and if you have uniformly spaced data, then you can use convn to compute a k x k moving average by doing:
y = convn(x, ones(k,k)/(k*k), 'same');
Note that you'll get some bias on the boundaries because you're technically trying to compute an average of k^2 pixels when you have less than that number of values available.
Alternatively, you can use nested calls to movmean since the averaging operation is separable:
y = movmean(movmean(x, k, 2), k, 1);
If your grid is separable, but not uniform, you can still use movmean, just use the SamplePoints name-value pair:
y = movmean(movmean(x, k, 2, 'SamplePoints', yv), k, 1, 'SamplePoints', xv);
You can also control the endpoint handling in movmean with the Endpoints name-value pair.

Clustering algorithm with different epsilons on different axes

I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.

How can I generate a set of n dimensional vectors that contains all integer points in an n-dimensional rectangular prism

Okay, so I'm working on a problem related to quantum chaos and one of the things I need to do is to map the unit cube in n-dimensions to a parallelepiped in n-dimensions and find all integer points in the interior of this parallelepiped. I have been trying to do this using the following scheme:
Given the linear map B and the dimension of the cube n, we find the coordinates of the corners of the unit hypercube by converting numbers j from 0 to (2^n -1) into their binary representation and turning them into vectors that describe the vertices of the cube.
The next step was to apply the map B to each of these vectors, which gives me a set of 2^n vectors describing the coordinates of the vertices of the parallelepiped in n dimensions
Now, we take the maximum and minimum value attained by any of these vertices in each coordinate direction, i.e the first element of my vectors might have a maximum value of 4 across all of the vertices and a minimum value of -3 etc. This gives me an n-dimensional rectangular prism that contains my parallelepiped and some extra unwanted space.
I now find all points with integer coordinates in this bounding rectangular prism described as vectors in n dimensions
Finally, I apply the inverse of the map B to each of the points and throw away any points that have any coefficients greater than 1 as they must originally have lain outside my unit hypercube.
My issue arises in step 4, I'm struggling to come up with a way of generating all vectors with integer coordinates in my rectangular hyper-prism such that I can change the number of dimensions n on the fly. Ideally, i'd like to be able to increase n at will until it becomes too computationally heavy to do so, but every method of finding all integer points in the prism i've tried so far has relied on n for loops to permute each element and thus I need to rewrite the code every time.
So I guess my question is this, is there any way to code this up so that I can change n on the fly? Also, any thoughts on the idea of the algorithm itself would be appreciated :) It wouldn't surprise me if i've overcomplicated things massively...
EDIT:
Of course as soon as I post the question I see a lovely little link in the side-bar where a clever method has been given already for how to do this: Generate a matrix containing all combinations of elements taken from n vectors
I'll leave this up for the moment just in case anyone has any comments on the method in general, but otherwise (since I can't upvote yet I'll just say it here) Luis Mendo, you are a hero!

Finding defined peaks with Clusters in MATLAB

this is my problem:
I have the next data "A", which looks like:
As you can see, I have drawn with red circles the apparently peaks, the most defined are 2 and 7, I say that they are defined because its standard deviation is low in comparison with the other peaks (especially the second one).
What I need is a way (anyway) to get the values and the standard deviation of n peaks in a numeric array.
I have tried with "clusters", but I got no good results:
First of all, I used "kmeans" MATLAB function, and I realize that this algorithm doesn't group peaks as I need. As you can see in the picture above, in the red circle, that cluster has at less 3 or 4 peaks. And kmeans need that you set the number of clusters, and I need to identify it automatically.
I hope that anyone can give me some ideas, or a way to get better results, thanks.
Pd: I leave the data "A" in the next link.
https://drive.google.com/file/d/0B4WGV21GqSL5a2EyQ2l0SHZURzA/edit?usp=sharing
The problem is that your axes have very different meaning.
K-means optimizes variance. But variance in X is something entirely different than variance in Y, isn't it? Furthermore, each of these methods will split your data in both X and Y, whereas I assume you want the data to be partitioned on the X axis only.
I suggest the following: consider the Y axis to be a weight, and X axis to be a position.
Then perform weighted density estimation, and look for low density to separate your clusters.
I can't help you with MATLAB. I don't use it.
Mathematically, what you want to do is place a Gaussian at each point, with area Y and center X. Then find minima and maxima on the sum of these Gaussians. See Wikipedia, Kernel Density Estimation for details; except that you want to use the Y axis as weights. You could maybe also use 1/Y as standard deviation, if you don't want to use weights.

Using triplequad to calculate density (in Matlab)

As i've explained in a previous question: I have a dataset consisting of a large semi-random collection of points in three dimensional euclidian space. In this collection of points, i am trying to find the point that is closest to the area with the highest density of points.
As high performance mark answered;
the most straightforward thing to do would be to divide your subset of
Euclidean space into lots of little unit volumes (voxels) and count
how many points there are in each one. The voxel with the most points
is where the density of points is at its highest. Perhaps initially
dividing your space into 2 x 2 x 2 voxels, then choosing the voxel
with most points and sub-dividing that in turn until your criteria are
satisfied.
Mark suggested i use triplequad for this, but this is not a function i am familiar with, or understand very well. Does anyone have any pointers on how i could go about using this function in Matlab for what i am trying to do?
For example, say i have a random normally distributed matrix A = randn([300,300,300]), how could i use triplequad to find the point i am looking for? Because as i understand currently, i also have to provide triplequad with a function fun when using it. Which function should that be for this problem?
Here's an answer which doesn't use triplequad.
For the purposes of exposition I define an array of data like this:
A = rand([30,3])*10;
which gives me 30 points uniformly distributed in the box (0:10,0:10,0:10). Note that in this explanation a point in 3D space is represented by each row in A. Now define a 3D array for the counts of points in each voxel:
counts = zeros(10,10,10)
Here I've chosen to have a 10x10x10 array of voxels, but this is just for convenience, it would be only a little more difficult to have chosen some other number of voxels in each dimension, and there don't have to be the same number of voxels along each axis. Then the code
for ix = 1:size(A,1)
counts(ceil(A(ix,1)),ceil(A(ix,2)),ceil(A(ix,3))) = counts(ceil(A(ix,1)),ceil(A(ix,2)),ceil(A(ix,3)))+1
end
will count up the number of points in each of the voxels in counts.
EDIT
Unfortunately I have to do some work this afternoon and won't be able to get back to wrestling with the triplequad solution until later. Hope this is OK in the meantime.