Streaming Kmeans Mahout - cluster-analysis

Streaming Kmeans Mahout - cluster-analysis

Can anyone explain to me,coceptually, how streaming kmeans algorithm works,and when would you recommend using it? I am not able to find much about it, and I would like to use Mahout implementation of it for clustering sensor data and possibly anomaly detection.
Thanks

Related

How to use Apache spark to implement GraphSAGE?

I want to use scala and spark to implement Graph algorithm GraphSAGE, then how to do it? Is there any source code?
I want to get the code for my question

I haven´t implemented yet this graph algorithms on top of Spark, the only available implementation, as far as I know, for using deep learning for graph analysis is this. It is a spectral graph convolution for semi-supervised learning, and it is a transductive algorithm. It can be used for node classification. I have plans to include more algorithms in the future like GraphSAGE.

How to load GraphFrame/Pyspark DataFrame into Pytorch Geometric (InMemory)Dataset?

Anybody ever done a custom pytorch.data.InMemoryDataset for a spark GraphFrame (or rather Pyspark DataFrames? Looked for people that have done it already but didn't find anything on GitHub/Stackoverflow et cetera and I have little knowledge of pytorch geometric as of right now.
Thankful for code samples, tips or matching links :)

You cannot run gcn on spark as of now. So PyTorch geometric doesn't support spark based training.

Custom loss function for Multiclass claasification in Scala and Spark

I want to ask is this possible to write a custom loss function for Multi class Classification in Spark using Scala. I want to code multi-class logarithmic loss in Scala. I searched Spark documentation but could not get any hint.

From the Spark 2.2.0 MLlib guide:
Currently, only binary classification is supported.. This will likely change when multiclass classification is supported.

If you are not restricted to a particular classification technique I would suggest using XGBoost. It has a Spark-compatible implementation, and it makes it possible to use any loss function provided you can compute is derivative twice.
You can find a tutorial here.
Also the explanation about why it is possible to use a custom loss function can be found here.

gaussian mixture model (GMM) mllib Apache Spark Scala

I do not think gaussian mixture model is available in mllib yet. I am wondering if any good Scala/Java implementation of GMM (suitable for large data) is available elsewhere. Please let me know.
Thanks and regards,

It is available in Spark MLlib now:
http://spark.apache.org/docs/latest/mllib-clustering.html#gaussian-mixture

Have a look at https://issues.apache.org/jira/browse/SPARK-4156
It is still under progress. We can expect it soon in MLLib.

Clustering classifier and clustering policy

I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath);
I was reading the description of these classes and it was not clear for me...
I was wondering what is the meaning of these cluster classifier and policy?
is it related with hierarchical clustering, centroid based clustering, distribution based
clustering etc?
Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.

The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.
These classes encode only the difference between these algorithms.
Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Streaming Kmeans Mahout - cluster-analysis

Can anyone explain to me,coceptually, how streaming kmeans algorithm works,and when would you recommend using it? I am not able to find much about it, and I would like to use Mahout implementation of it for clustering sensor data and possibly anomaly detection. Thanks

Related

How to use Apache spark to implement GraphSAGE?

How to load GraphFrame/Pyspark DataFrame into Pytorch Geometric (InMemory)Dataset?

Custom loss function for Multiclass claasification in Scala and Spark

gaussian mixture model (GMM) mllib Apache Spark Scala

Clustering classifier and clustering policy

Categories

Resources