How can I weight the dimensions as I am computing KNN of given instance vector in MATLAB? - matlab

Suppose I have a bunch of instances and I want to find the closest K instances to a particular instance. Moreover, I have some weights showing the strengths of each dimension as we computing the distances. How can I incorporate these weights with the KNN finding process in MATLAB?

There are two methods that can allow you to do this. Looking at the knnsearch documentation, you can either use the seuclidean flag where this performs the standardized Euclidean distance. Each co-ordinate difference between two points is scaled by dividing by a corresponding scale value in S. S by default is the standard deviation for each co-ordinate. You can manually specify each of these scales by specifying the Scale parameter, then specifying a vector where each component will scale each dimension for you instead of the standard deviation in each dimension.
As such, the more contribution a co-ordinate has, the larger the scale should be, as you want to aggregate co-ordinates and will allow distances that are larger to have a smaller Euclidean distance. This is essentially the same thing as weighting the strengths in each dimension.
Alternatively, you can provide your own function that computes the distance between two vectors. You can define what these weights are in your workspace before hand, then create an anonymous function wrapper that accesses these weights when computing whatever distance measure you want yourself. The anonymous function can only take in two vectors, corresponding to two different co-ordinate vectors in KNN. As such, use this anonymous function to access the weights that should be already defined in the workspace then go from there.
Check out: http://www.mathworks.com/help/stats/knnsearch.html

Related

Clustering algorithm with different epsilons on different axes

I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.

Which scaling technique does it use?

I have a matrix X, the size of which is 100*2000 double. I want to know which kind of scaling technique is applied to matrix X in the following command, and why it does not use z-score to do scaling?
X = X./repmat(sqrt(sum(X.^2)),size(X,1),1);
That scaling comes from linear algebra. That's what we call normalizing by producing a unit vector. Assuming that each row is an observation and each column is a feature, what's happening here is that we are going through every observation that you collected and normalizing each feature value over all observations such that the overall length / magnitude of a particular feature for all observations is set to 1.
The bottom division takes a look at each feature and determines the norm or magnitude of the feature over all observations. Once you find these magnitudes, you then take each feature for each observation and divide by their respective magnitudes.
The reason why unit vectors are often employed is to describe a point in feature space with respect to a set of basis vectors. Normalizing by producing unit vectors gives you the smallest possible way to represent one component in feature space and so what's probably happening here is that the observations are now being transformed such that each component / feature is being represented in terms of a set of basis vectors. Each basis vector is one feature in the data.
Check out the Wikipedia article on Unit Vectors for more details: http://en.wikipedia.org/wiki/Unit_vector

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.

Creating a 1D Second derivative of gaussian Window

In MATLAB I need to generate a second derivative of a gaussian window to apply to a vector representing the height of a curve. I need the second derivative in order to determine the locations of the inflection points and maxima along the curve. The vector representing the curve may be quite noise hence the use of the gaussian window.
What is the best way to generate this window?
Is it best to use the gausswin function to generate the gaussian window then take the second derivative of that?
Or to generate the window manually using the equation for the second derivative of the gaussian?
Or even is it best to apply the gaussian window to the data, then take the second derivative of it all? (I know these last two are mathematically the same, however with the discrete data points I do not know which will be more accurate)
The maximum length of the height vector is going to be around 100-200 elements.
Thanks
Chris
I would create a linear filter composed of the weights generated by the second derivative of a Gaussian function and convolve this with your vector.
The weights of a second derivative of a Gaussian are given by:
Where:
Tau is the time shift for the filter. If you are generating weights for a discrete filter of length T with an odd number of samples, set tau to zero and allow t to vary from [-T/2,T/2]
sigma - varies the scale of your operator. Set sigma to a value somewhere between T/6. If you are concerned about long filter length then this can be reduced to T/4
C is the normalising factor. This can be derived algebraically but in practice I always do this numerically after calculating the filter weights. For unity gain when smoothing periodic signals, I will set C = 1 / sum(G'').
In terms of your comment on the equivalence of smoothing first and taking a derivative later, I would say it is more involved than that. As which derivative operator would you use in the second step? A simple central difference would not yield the same results.
You can get an equivalent (but approximate) response to a second derivative of a Gaussian by filtering the data with two Gaussians of different scales and then taking the point-wise differences between the two resulting vectors. See Difference of Gaussians for that approach.

Controlled random number/dataset generation in MATLAB

Say, I have a cube of dimensions 1x1x1 spanning between coordinates (0,0,0) and (1,1,1). I want to generate a random set of points (assume 10 points) within this cube which are somewhat uniformly distributed (i.e. within certain minimum and maximum distance from each other and also not too close to the boundaries). How do I go about this without using loops? If this is not possible using vector/matrix operations then the solution with loops will also do.
Let me provide some more background details about my problem (This will help in terms of what I exactly need and why). I want to integrate a function, F(x,y,z), inside a polyhedron. I want to do it numerically as follows:
$F(x,y,z) = \sum_{i} F(x_i,y_i,z_i) \times V_i(x_i,y_i,z_i)$
Here, $F(x_i,y_i,z_i)$ is the value of function at point $(x_i,y_i,z_i)$ and $V_i$ is the weight. So to calculate the integral accurately, I need to identify set of random points which are not too close to each other or not too far from each other (Sorry but I myself don't know what this range is. I will be able to figure this out using parametric study only after I have a working code). Also, I need to do this for a 3D mesh which has multiple polyhedrons, hence I want to avoid loops to speed things out.
Check out this nice random vectors generator with fixed sum FEX file.
The code "generates m random n-element column vectors of values, [x1;x2;...;xn], each with a fixed sum, s, and subject to a restriction a<=xi<=b. The vectors are randomly and uniformly distributed in the n-1 dimensional space of solutions. This is accomplished by decomposing that space into a number of different types of simplexes (the many-dimensional generalizations of line segments, triangles, and tetrahedra.) The 'rand' function is used to distribute vectors within each simplex uniformly, and further calls on 'rand' serve to select different types of simplexes with probabilities proportional to their respective n-1 dimensional volumes. This algorithm does not perform any rejection of solutions - all are generated so as to already fit within the prescribed hypercube."
Use i=rand(3,10) where each column corresponds to one point, and each row corresponds to the coordinate in one axis (x,y,z)