Data normalization for K-Means algorithm - cluster-analysis

I want clustered my data using K-Means algorithm for this my data should be normalized I don't know which method of normalization is better for this algorithm? (min-max or z-transformation or decimal or...)rapid miner normalized data with z-transformation method but how I can implementation min-max normalization with rapid miner ? or which tools and method better for normalized data? I should be check that my data need to normalization? How?

The proper way of normalization depends on your data.
As a rule of thumb:
If all axes measure the same thing, normalization is probably harmful
If axes have different units and very different scale, normalization is absolutely necessary (otherwise, you are comparing apples and oranges).
If you know or assume that certain attributes are more important than others, consider manual weighting of attributes.
As for min-max or z-transformation: this depends on the distribution of the data. If you have outliers, min-max does not work well.

Related

How to do Hierarchical Heteroskedastic Sparse GPs in GPflow?

Is is possible to model a general trend from a population using GPflow and also have individual predictions, as in Hensman et al?
Specifically, I am trying to fit spatial data from a bunch of individuals from a clinical assessment. For each individual, am I dealing with approx 20000 datapoints (different number of recordings for each individual), which definitely restricts myself to a sparse implementation. In addition to this, there also seemes that I need an input dependent noise model, hence the heteroskedasticity.
I have fitted a hetero-sparse model as in this notebook example, but I am not sure how to scale it to perform the hierarchical learning. Any ideas would be welcome :)
https://github.com/mattramos/SparseHGP may be helpful. This repo is gives GPFlow2 code for modelling a sparse hierarchical model. Note, there are still some rough edges in the implementation that require an expensive for loop to be constructed.

MATLAB fitlm: OLS vs Robust regression

I am trying to calculate a linear regression of some data that I have using MATLAB's fitlm tool. Using ordinary least-squares (OLS) I get fairly low R-squared values (~ 0.2-0.5), and occasionally even unrealistic results. Whereas when using robust regression (specifically the 'talwar' option), I get much better results (R2 ~ 0.7-0.8).
I am no statistician, so my question is: Is there any reason I should not believe that the robust results are better?
Here is an example of some of the data. The data shown produces R2 of OLS: 0.56, robust:0.72.
One reason you're going to get notable differences in R values is that the Talwar handles outliers differently. Talwar subdivides your data set into segments and computes averages for each of those segments.
Taken from the abstract of Talwar's paper:
'Estimates of the parameters of a linear model are usually obtained by the method of ordinary least-squares (OLS), which is sensitive to large values of the additive error term... we obtain a simple, consistent and asymptotically normal initial estimate of the coefficients, which protects the analyst from large values of εi which are often hard to detect using OLS on a model with many regressors. '- https://www.jstor.org/stable/2285386?seq=1#page_scan_tab_contents
Whether Talwar or OLS is better depends on your knowledge of the measurement process (namely, how outliers can be explained). If appropriate, and you prune the data with a Q-test to remove outliers ( see http://education.mrsec.wisc.edu/research/topic_guides/outlier_handout.pdf), that should minimize the differences in R you see between Talwar and OLS.
Of course yes. The idea of robust regression is very broad. There are different types of Robust Regression. Thus, there are a situations that the performance of one robust regression is better than the other robust regression methods.

What is the importance of clustering?

During unsupervised learning we do cluster analysis (like K-Means) to bin the data to a number of clusters.
But what is the use of these clustered data in practical scenario.
I think during clustering we are losing information about the data.
Are there some practical examples where clustering could be beneficial?
The information loss can be intentional. Here are three examples:
PCM signal quantification (Lloyd's k-means publication). You know that are certain number (say 10) different signals are transmitted, but with distortion. Quantifying removes the distortions and re-extracts the original 10 different signals. Here, you lose the error and keep the signal.
Color quantization (see Wikipedia). To reduce the number of colors in an image, a quite nice method uses k-means (usually in HSV or Lab space). k is the number of desired output colors. Information loss here is intentional, to better compress the image. k-means attempts to find the least-squared-error approximation of the image with just k colors.
When searching motifs in time series, you can also use quantization such as k-means to transform your data into a symbolic representation. The bag-of-visual-words approach that was the state of the art for image recognition prior to deep learning also used this.
Explorative data mining (clustering - one may argue that above use cases are not data mining / clustering; but quantization). If you have a data set of a million points, which points are you going to investigate? clustering methods try ro split the data into groups that are supposed to be more homogeneous within and more different to another. Thrn you don't have to look at every object, but only at some of each cluster to hopefully learn something about the whole cluster (and your whole data set). Centroid methods such as k-means even can proviee a "prototype" for each cluster, albeit it is a good idea to also lool at other points within the cluster. You may also want to do outlier detection and look at some of the unusual objects. This scenario is somewhere inbetween of sampling representative objects and reducing the data set size to become more manageable. The key difference to above points is that the result is usually not "operationalized" automatically, but because explorative clustering results are too unreliable (and thus require many iterations) need to be analyzed manually.

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

Resampling data with minimal loss of information in time-domain

I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.
You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason