K-medioids with Dynamic Time Warping in RapidMiner - cluster-analysis

How to perform K-medioids clustering with Dynamic Time Warping as a distance measure in RapidMiner?
The idea with Dynamic Time Warping is to perform it on time series of different length. How can I do that in RapidMiner? I get this error message
The data contains missing values which is not allowed for KMediods
How can I cluster time series of different length?

You could fill the missing values with zeroes. The operator Replace Missing Values does this. I don't know the details of your data nor how RapidMiner calculates DTW distances so I therefore can't tell if this approach would yield valid results.
Faced with this, I might use the R extension with the dtw and cluster packages to investigate how distances between different length time series could be used to make clusters. Once you have R working, you can call it from RapidMiner.

Related

Measuring curve “closeness” with unequal data ranges

Provided that I have a similar example:
where the blue data is my calculated/measured data and my red data is the given groundtruth data. The task is to get the similarity/closeness between the data and each of the given curves so that a classification can be done, it could also be possible to choose multiple classes if the results seem to be very close.
I can divide the problem in my mind to several subproblems:
The data range is no the same
The resolution of the calculated/measured data is higher than the ground-truth data
The calculated data has some bias/shift
The following questions come to my mind when trying to solve those problems
Is it better to fit the calculated/measured data first then attempting to solve the problem?
Would it be fine to use the data points as is and calculate the mean squared error of each curve assuming it is a fitting attempt and thus getting the best fit? What would be the effect of the bias/shift in this case?
What is a good approach to dealing with the data/range mismatch, by decreasing the number of samples for the higher sampled version or increasing the number of samples for the lower sampled data in the given range?

Computing similarity matrix with mixed data

I have asked this question also on "Cross Validated" forum, but with no answer so far, so I am trying also here:
I would like to compute similarity matrix (which I will further use for clustering purposes) from my data (failure data from automotive company). The data consist of these variables:
START DATE + TIME (dd/mm/yyyy hh/mm/ss), DURATION (in seconds), DAY OF THE WEEK (mon,tue,...), WORKING TEAM (1,2,3), LOCALIZATION (1,2,3,...,20), FAILURE TYPE
From this, it is clear, that there are continuous and categorical data. What method would you suggest to calculate similarities between failure types? I think I can not use Euclidean distance, or Gowe's similarity. Thank you in advance.
No, you need an ad hoc function that represents your knowledge about what the data means in the real world. Presumably it will be mainly applying a weight to a continuous difference, and a 2D simple matrix for the discrete categorical variables. But don't rule our censoring of extreme values or fuzzification.

Best way to validate DBSCAN Clusters

I have used the ELKI implementation of DBSCAN to identify fire hot spot clusters from a fire data set and the results look quite good. The data set is spatial and the clusters are based on latitude, longitude. Basically, the DBSCAN parameters identify hot spot regions where there is a high concentration of fire points (defined by density). These are the fire hot spot regions.
My question is, after experimenting with several different parameters and finding a pair that gives a reasonable clustering result, how does one validate the clusters?
Is there a suitable formal validation method for my use case? Or is this subjective depending on the application domain?
ELKI contains a number of evaluation functions for clusterings.
Use the -evaluator parameter to enable them, from the evaluation.clustering.internal package.
Some of them will not automatically run because they have quadratic runtime cost - probably more than your clustering algorithm.
I do not trust these measures. They are designed for particular clustering algorithms; and are mostly useful for deciding the k parameter of k-means; not much more than that. If you blindly go by these measures, you end up with useless results most of the time. Also, these measures do not work with noise, with either of the strategies we tried.
The cheapest are the label-based evaluators. These will automatically run, but apparently your data does not have labels (or they are numeric, in which case you need to set the -parser.labelindex parameter accordingly). Personally, I prefer the Adjusted Rand Index to compare the similarity of two clusterings. All of these indexes are sensitive to noise so they don't work too well with DBSCAN, unless your reference has the same concept of noise as DBSCAN.
If you can afford it, a "subjective" evaluation is always best.
You want to solve a problem, not a number. That is the whole point of "data science", being problem oriented and solving the problem, not obsessed with minimizing some random quality number. If the results don't work in reality, you failed.
There are different methods to validate a DBSCAN clustering output. Generally we can distinguish between internal and external indices, depending if you have labeled data available or not. For DBSCAN there is a great internal validation indice called DBCV.
External Indices:
If you have some labeled data, external indices are great and can demonstrate how well the cluster did vs. the labeled data. One example indice is the RAND indice.https://en.wikipedia.org/wiki/Rand_index
Internal Indices:
If you don't have labeled data, then internal indices can be used to give the clustering result a score. In general the indices calculate the distance of points within the cluster and to other clusters and try to give you a score based on the compactness (how close are the points to each other in a cluster?) and
separability (how much distance is between the clusters?).
For DBSCAN, there is one great internal validation indice called DBCV by Moulavi et al. Paper is available here: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96
Python package: https://github.com/christopherjenness/DBCV

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

How can I perform K-means clustering on time series data?

How can I do K-means clustering of time series data?
I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data.
I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above.
Does anyone know how to do this? For example, how could I modify this k-means MATLAB code so that it would work for time series data? Also, I would like to be able to use different distance metrics besides Euclidean distance.
To better illustrate my doubts, here is the code I modified for time series data:
% Check if second input is centroids
if ~isscalar(k)
c=k;
k=size(c,1);
else
c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end
% allocating variables
g0=ones(n,1);
gIdx=zeros(n,1);
D=zeros(n,k);
% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
% disp(sum(g0~=gIdx))
g0=gIdx;
% Loop for each centroid
for t=1:k
% d=zeros(n,1);
% Loop for each dimension
for s=1:n
D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2));
end
end
% Partition data to closest centroids
[z,gIdx]=min(D,[],2);
% Update centroids using means of partitions
for t=1:k
% Is this how we calculate new mean of the time series?
c(t,:)=mean(X(gIdx==t,:));
end
end
Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.
k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.
Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.
It's probably too late for an answer, but:
k-means can be used to cluster longitudinal data
Anony-Mousse is right, DWT distance is the way to go for time series
The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".
I have recently come across the kml R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.
Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.
If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.
I don't think k-means is the right way for it either. As #Anony-Mousse suggested you can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python. The logic is;
Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
For each series, calculate distances for each center in each cluster groups and assign it to the minimum one.
For each cluster groups, calculate total distance within individual clusters.
Choose the minimum.
And, the Python implementation is here if you're interested.