Is it possible to initialize centers with specific values for spark kmeans? - pyspark

I am using kmeans from sklearn and from pyspark.ml.
The spark version is much faster. However, it doesn't seem have an option that I need. With sklearn kmeans I can specify an initial values for the cluster centers: KMeans(init=centers,...).
I don't see such an option for pyspark. Am I missing it, or am I out of luck and it doesn't exist?
Thank you

Related

How to load GraphFrame/Pyspark DataFrame into Pytorch Geometric (InMemory)Dataset?

Anybody ever done a custom pytorch.data.InMemoryDataset for a spark GraphFrame (or rather Pyspark DataFrames? Looked for people that have done it already but didn't find anything on GitHub/Stackoverflow et cetera and I have little knowledge of pytorch geometric as of right now.
Thankful for code samples, tips or matching links :)
You cannot run gcn on spark as of now. So PyTorch geometric doesn't support spark based training.

Evaluation metrics for Kmeans clustering in pyspark ML lib

can anyone please share any evaluation metrics used for KMeans clustering in pyspark ML libraray. Except Silhouette or SSE score, which I already have calculated.
I found a couple of metrics but they are available in scikit library of python but I am working in pyspark, example Calinski-Harabasz Index to name.

K-means finds a singleton cluster when I standardize features (Wholesale customers dataset)

I am studying the Wholesale customers dataset. Running the elbow method I find that k=5 seems to be a good number of clusters. Unfortunately, when I standardize my features I get a singleton cluster, even with several inits. This does not happen when I don't standardize.
I know that standardization of the features is an often-asked question, however I still don't understand if that's good practice or not. Here I standardize because the variances of some features are quite different. If it's a bad idea here, can you please explain why?
Here is an example of MDS visualisation of K-means result. As you can see, at the bottom left of the picture there is point which has its own cluster (it has a unique color). Is it because it's an outlier? Should I remove it by hand before running K-means?
Here is a MWE if you want to rerun the experiment yourself. Please don't hesitate to be straightforward if I somehow made a mistake.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.manifold import MDS
df = pd.read_csv("./wholesale-dataset.csv")
X = StandardScaler().fit_transform(df.values[:,2:])
km = KMeans(5)
km.fit(X)
mds = MDS().fit_transform(X)
fkm = plt.figure()
fkm.gca().scatter(mds[:,0], mds[:,1], c=km.labels_)
There is nothing wrong with k-means producing Singleton clusters.
When you have outliers in your data, making such clusters is likely improving the SSE objective of k-means. So this behavior is correct.
But judging from your plot, I'd argue that your correct k is 1. There is one big blob there, some outliers, but not multiple clusters.

How to find silhoutte in k means clustering while doing it in Scala/Spark 2.2.0

I have been working on clustering a dataset in scala using spark 2.2.0. Now i have made the clusters , i want to test/evaluate the quality of it.Though i have been able to find the Set Of Sum of squared of errors for each value of K, but i was hoping to do a silhouette test. could any one please help in sharing any relevant functions,packages for doing so in scala.
Silhouette is not scalable. It uses pairwise distances, this will always take O(n^2) time to compute.
Have you considered using already implemented in MLlib Within Set Sum of Squared Errors (http://spark.apache.org/docs/latest/ml-clustering.html#k-means) which also can help determining the number of clusters. (Cluster analysis in R: determine the optimal number of clusters)

Clustering data with categorical and numeric features in Apache Spark

I am currently looking for an Algorithm in Apache Spark (Scala/Java) that is able to cluster data that has numeric and categorical features.
As far as I have seen, there is an implementation for k-medoids and k-prototypes for pyspark (https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes), but I could not identify something similar for the Scala/Java version I am currently working with.
Is there another recommend algorithm to achieve similar things for Spark running Scala? Or am I overlooking something and could actually make use of the pyspark library in my Scala project?
If you need further information or clarification feel free to ask.
I think you need first to convert your categorical variables to numbers using OneHotEncoder then, you can apply your clustering algorithm using mllib (e.g. kmeans). Also, I recommend doing scaling or normalization before applying your cluster algorithm as it is distance sensitive.