Evaluation metrics for Kmeans clustering in pyspark ML lib - pyspark

can anyone please share any evaluation metrics used for KMeans clustering in pyspark ML libraray. Except Silhouette or SSE score, which I already have calculated.
I found a couple of metrics but they are available in scikit library of python but I am working in pyspark, example Calinski-Harabasz Index to name.

Related

Is it possible to initialize centers with specific values for spark kmeans?

I am using kmeans from sklearn and from pyspark.ml.
The spark version is much faster. However, it doesn't seem have an option that I need. With sklearn kmeans I can specify an initial values for the cluster centers: KMeans(init=centers,...).
I don't see such an option for pyspark. Am I missing it, or am I out of luck and it doesn't exist?
Thank you

How to find silhoutte in k means clustering while doing it in Scala/Spark 2.2.0

I have been working on clustering a dataset in scala using spark 2.2.0. Now i have made the clusters , i want to test/evaluate the quality of it.Though i have been able to find the Set Of Sum of squared of errors for each value of K, but i was hoping to do a silhouette test. could any one please help in sharing any relevant functions,packages for doing so in scala.
Silhouette is not scalable. It uses pairwise distances, this will always take O(n^2) time to compute.
Have you considered using already implemented in MLlib Within Set Sum of Squared Errors (http://spark.apache.org/docs/latest/ml-clustering.html#k-means) which also can help determining the number of clusters. (Cluster analysis in R: determine the optimal number of clusters)

Clustering data with categorical and numeric features in Apache Spark

I am currently looking for an Algorithm in Apache Spark (Scala/Java) that is able to cluster data that has numeric and categorical features.
As far as I have seen, there is an implementation for k-medoids and k-prototypes for pyspark (https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes), but I could not identify something similar for the Scala/Java version I am currently working with.
Is there another recommend algorithm to achieve similar things for Spark running Scala? Or am I overlooking something and could actually make use of the pyspark library in my Scala project?
If you need further information or clarification feel free to ask.
I think you need first to convert your categorical variables to numbers using OneHotEncoder then, you can apply your clustering algorithm using mllib (e.g. kmeans). Also, I recommend doing scaling or normalization before applying your cluster algorithm as it is distance sensitive.

Visualization of spark machine learning in scala

I have developed an application performing Logistic regression using Spark mllib.How can we visually perceive the results?. I mean, in R-programming,we can see the result graphically.Is there any way to visualize the results in scala spark program as well.

How to do distributed Principal Components Analysis + Kmeans using Apache Spark?

I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.
I know that Spark supports PCA and also PCA + Kmeans.
However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.