I used optics.m function from http://chemometria.us.edu.pl/download/OPTICS.M to calculate optics algorithm in MATLAB. This function outputs RD and CD and Order vector of all points.
I used bar(RD(order)); code to display Reachability plot of them. But I want to index clusters of points and scatter them in MATLAB. How can i do that?
Don't use that code. It is slow, and I read somewhere here that it is even giving incorrect results?
But most of all, it lacks cluster extraction functionality; it is only half of OPTICS.
Related
I have a dataset which I need to cluster and display in a way wherein elements in the same cluster should appear closer together. The dataset is based out of a research study, and has around 16 rows(entries) and about 50 features. I do agree that its not an ideal dataset to begin with, but unfortunately thats is the situation on hand.
Following is the approach I took:
I first applied KMeans on the dataset after normalizing it.
In parallel I also tried to use TSNE to map the data into 2 dimensions and plotted them on a scatterplot. From my understanding of TSNE, that technique should already be placing items in same clusters closer to each other. When I look at the scatterplot, however, the clusters are really all over the place.
The result of the scatterplot can be found here: https://imgur.com/ZPhPjHB
Is this because TSNE and KMeans intrinsically work differently? Should I just do TSNE and try to label the clusters (and if so, how?) or should I be using TSNE output to feed into KMeans somehow?
I am really new in this space and advice would be greatly appreciated!
Thanks in advance once again
Edit: The same overlap happens if I first use TSNE to reduce dimensions to 2 and then use those reduced dimensions to cluster using KMeans
There is a difference between TSNE and KMeans. TSNE is used for visualization mostly and it tries to project points on the 2D/3D space (from bigger spaces) in order to keep distances (if in the bigger space 2 points were far away TSNE will try to show it).
So TSNE is not a real clustering. And that's why results you got that strange scatter plot.
For TSNE sometimes you need to apply PCA before but that is needed if your number of features is big. Just to speed-up calculations.
As already advised, try to use hierarchical clustering or simply generate more rows.
Apply tSNE and fit k-means is one of the basic things you can start from.
I would say consider using different f-divergence.
Stochastic Neighbor Embedding under f-divergences https://arxiv.org/pdf/1811.01247.pdf
This paper tries five different f- divergence functions : KL, RKL, JS, CH (Chi-Square), HL (Hellinger).
The paper goes over which divergence emphasize what in terms of precision and recall.
I am using Matlab 2015a.
I have got electricity consumption data to cluster it. Initially i am trying to cluster it against hours and dates. I have created three different variables, one for time, one for dates and third for data. I am unable to understand how should i combine these in a matrix form so that the loads are distributed according to time? Then i have tried to look how can i plot a line graph for k-means but i can only find scatter command graphs but no line plots.
Further how can i plot it as a 3-d plot?
Further at a later stage i want to include temperature variable aswel. But when the 4th variable is involved, what will the plot be? will it still be 3-d?
Any suggestions, links?
In Matlab you can create N-dimensional matrices, so you can arrange your 3D data in a N*M*3 matrix (you might want to look for the cat() function to help you out).
There are several functions that allow you to plot in 3D, one of these is scatter3() which is perfect for K-Means clustering. I don't really understand which lines you do want to plot: K-Means is about clusters and centroids (i.e. points).
If a 4th variable is involved, you can as well create a 4D matrix. Although I reckon plotting a 4D graph isn't going to be easy. A first approach might be using several colours for your scatter points with different colours for different temperatures (or temperatures range). In this case the 5th input argument for scatter3() will be helpful.
Help for scatter3() here.
Help for cat() here.
I have about 100 data points which mostly satisfying a certain function (but some points are off). I would like to plot all those points in a smooth curve but the problem is the points are not uniformly distributed. So is that anyway to get the smooth curve? I am thinking to interpolate some points in between, but the only way that comes up to my mind is to linearly insert some artificial points between two data points. But that will show a pretty weird shape (like some sharp corner). So any better idea? Thanks.
If you know more or less what the actual curve should be, you can try to fit that curve to your points (e.g. using polyfit). Depending on how many points are off and how far, you can get by with least squares regression (which is fairly easy to get working). If you have too many outliers (or they are much too large/small), you can also try robust regression (e.g. least absolute deviation fitting) using the robustfit function.
If you can manually determine the outliers, you can also fit a curve through the other points to get better results or even use interpolation methods (e.g. interp1 in MATLAB) on those points to get a smoother curve.
If you know which function describes your data, robust fitting (using, e.g. ROBUSTFIT, or the new convenient functions LINEARMODEL and NONLINEARMODEL with the robust option) is a good way to go if there are outliers in your data.
If you don't know the function that describes your data, but want a smooth trendline that is little affected by outliers, SMOOTHN from the File Exchange does an excellent job in my experience.
Have you looked at the use of smoothing splines? Like interpolating splines, but with the knot points and coefficients chosen to minimise a least-squares error function. There is an excellent implementation available from Matlab central which I have used successfully.
I running kmeans in matlab on a 400x1000 matrix and for some reason whenever I run the algorithm I get different results. Below is a code example:
[idx, ~, ~, ~] = kmeans(factor_matrix, 10, 'dist','sqeuclidean','replicates',20);
For some reason, each time I run this code I get different results? any ideas?
I am using it to identify multicollinearity issues.
Thanks for the help!
The k-means implementation in MATLAB has a randomized component: the selection of initial centers. This causes different outcomes. Practically however, MATLAB runs k-means a number of times and returns you the clustering with the lowest distortion. If you're seeing wildly different clusterings each time, it may mean that your data is not amenable to the kind of clusters (spherical) that k-means looks for, and is an indication toward trying other clustering algorithms (e.g. spectral ones).
You can get deterministic behavior by passing it an initial set of centers as one of the function arguments (the start parameter). This will give you the same output clustering each time. There are several heuristics to choose the initial set of centers (e.g. K-means++).
As you can read on the wiki, k-means algorithms are generally heuristic and partially probabilistic, the one in Matlab being no exception.
This means that there is a certain random part to the algorithm (in Matlab's case, repeatedly using random starting points to find the global solution). This makes kmeans output clusters that are of good-quality-on-average. But: given the pseudo-random nature of the algorithm, you will get slightly different clusters each time -- this is normal behavior.
This is called initialization problem, as kmeans starts with random iniinital points to cluster your data. matlab selects k random points and calculates the distance of points in your data to these locations and finds new centroids to further minimize the distance. so you might get different results for centroid locations, but the answer is similar.
I have some dynamic light scattering data. The machine pumps out the autocorrelation function, and a count-rate.
I can do a simple fit to the ACF
ACF = exp(-D*q^2*t)
and obtain the diffusion coefficient.
I want to obtain the same D from the power spectrum. I have been able to create a power spectrum in two ways -- from the Fourier transform of the ACF, and from the count rate. Both agree, but the power spectrum does not look like in the one in the books, so I'm not sure how to use it to work out the line width.
Attached is an image from a PDF that shows what you should get, and what I get from MATLAB. Can anyone make sense of whats going on?
I have used the code of answer #3 on this question. The resulting autocorrelation comes out exactly the same as
the machine gives me and
using MATLAB's autocorr command on the photoncount data.
Thank you for your time.
When you compute the Fourier transform from short sequences of data it often looks very noisy. There are a number of reasons for this. One reason is that the statistics of individual Fourier components are not Gaussian, and so averaging the spectra across multiple samples of data will only slowly improve the quality of the estimate.
Another causes of "noisiness" in empirical spectra behavior is that you are applying (to a finite data sample) a transform which involves a pathological sinc function and which assumes an infinite length signal. To diminish this problem, it helps to apply a "windowing-function" to your data before computing the Fourier transform. One of the more complicated but also more powerful windowing approaches is the use of so-called 'Slepian tapers'.
MATLAB conveniently implements well-known windows in functions such as hamming and hann.