Can I use DBSCAN to non numeric data? - cluster-analysis

I want to make clustering in java with dbscan implementation.
Can dbscan be applied to non numeric data (pair of numbers)?
If yes, how ?

All you need is a distance function.
In fact (GDBSCAN) you only need a binary function isNeighbor(x,y).
Jörg Sander, Martin Ester, Hans-Peter Kriegel und Xiaowei Xu: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. In: Data Mining and Knowledge Discovery. 2. Auflage. Band 2. Springer, Berlin 1998, doi:10.1023/A:1009745219419

Related

Spatial features and pattern analysis of a plan?

I am working on instances from the TSPLIB, which are simply coordinates of nodes in a plan. I'm looking to analyze spatial characteristics and features of a set of instances (e.g. clustered, not clustered, dispersed, etc) and I would like to implement some code in Matlab to analyze and compute specific features.
For example, so far, I have used Nearest Neighbor analysis to identify clusters, as well as quadrant analysis. Can anyone suggest any other spatial features and patterns that could be computed with some relatively simple code? Anybody maybe expert in the Traveling Salesman Problem. Thank you so much!
K-means is a very useful clustering tool that you can use.
https://www.mathworks.com/help/stats/kmeans.html
Nearest Neighbor is a classification methods. if you want to do classification you can use K Nearest Neighbors, SVM or Neural Networks Pattern recognition toolbox. these are all already in Matlab.
Also, check out Matlab Apps. there are some very cool clustering tools available as well with examples.

how to calculate Davies Bouldin from clustering methods in rapidminer?

I want to cluster data without k-means. for example I prefer to cluster with DBSCAN or support vector clustering.
So I need evaluating performance of clustering with Davies Bouldin metric but I don't know how to calculate Davies Bouldin in Rapidminer for DBSCAN or Support vector clustering.
Please help me.
Thank you.
The operator Cluster Distance Performance allows the Davies-Bouldin validity measure to be calculated. This requires a cluster model containing the cluster centroids to be passed to it which means approaches like Dbscan and Support vector clustering cannot be used with it because they do not produce cluster centroids.

One-Hot-Encoding for Wide&Deep-Model and normalization by using DNNLinearCombinedClassifier

I'm starting to use Tensorflow and read the Whitepaper of Wide&Deep Learning for Recommender Systems from Cheng et al. Now I have questions in two domains by using it with tensorflow like it's done in the tutorial.
If I use categorial Columns like tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000): Uses the model a One-Hot-Encoding for this feature in its back? Or can I apply a One-hot-Encoding and are there advantages of this hashing instead of One-Hot-Encoding? What's about the Hash: Do I have to set the hash_bucket_sizeto a value as high as distinct values of this feature?
The second question is about normalization of continous real-valued-features. Cheng et al. wrote that their "Continuous real-valued features are normalized to [0,1]". If I use the tf.contrib.learn.DNNLinearCombinedClassifier will this be done automatically? Or is it up to me to implement these normalizations?
Thanks for your support!

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

ELKI OPTICSXi - how set xi?

I'm trying to use ELKI to cluster a dataset of geolocations using OPTICS. I've understood that to extract the clusters, I need to use the OPTICSXi algorithm rather than OPTICS which computes just the clusters order.
I was wondering if you could give me more information on how the parameter xi works.
I fixed this value at 0.009 but in a random way.
You can read up on the Xi parameter in
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander (1999).
OPTICS: Ordering Points To Identify the Clustering Structure
ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60.
it is a contrast parameter, the relative decrease in density. I usually try values such as 0.1 (= 10% drop in density). However, the exact drop in density to be expected heavily depends on your data set and parameters, for obvious reasons.