Can anyone explain to me,coceptually, how streaming kmeans algorithm works,and when would you recommend using it? I am not able to find much about it, and I would like to use Mahout implementation of it for clustering sensor data and possibly anomaly detection.
Thanks
Related
I want to use scala and spark to implement Graph algorithm GraphSAGE, then how to do it? Is there any source code?
I want to get the code for my question
I havenĀ“t implemented yet this graph algorithms on top of Spark, the only available implementation, as far as I know, for using deep learning for graph analysis is this. It is a spectral graph convolution for semi-supervised learning, and it is a transductive algorithm. It can be used for node classification. I have plans to include more algorithms in the future like GraphSAGE.
Anybody ever done a custom pytorch.data.InMemoryDataset for a spark GraphFrame (or rather Pyspark DataFrames? Looked for people that have done it already but didn't find anything on GitHub/Stackoverflow et cetera and I have little knowledge of pytorch geometric as of right now.
Thankful for code samples, tips or matching links :)
You cannot run gcn on spark as of now. So PyTorch geometric doesn't support spark based training.
I want to ask is this possible to write a custom loss function for Multi class Classification in Spark using Scala. I want to code multi-class logarithmic loss in Scala. I searched Spark documentation but could not get any hint.
From the Spark 2.2.0 MLlib guide:
Currently, only binary classification is supported.. This will likely change when multiclass classification is supported.
If you are not restricted to a particular classification technique I would suggest using XGBoost. It has a Spark-compatible implementation, and it makes it possible to use any loss function provided you can compute is derivative twice.
You can find a tutorial here.
Also the explanation about why it is possible to use a custom loss function can be found here.
I do not think gaussian mixture model is available in mllib yet. I am wondering if any good Scala/Java implementation of GMM (suitable for large data) is available elsewhere. Please let me know.
Thanks and regards,
It is available in Spark MLlib now:
http://spark.apache.org/docs/latest/mllib-clustering.html#gaussian-mixture
Have a look at https://issues.apache.org/jira/browse/SPARK-4156
It is still under progress. We can expect it soon in MLLib.
I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath);
I was reading the description of these classes and it was not clear for me...
I was wondering what is the meaning of these cluster classifier and policy?
is it related with hierarchical clustering, centroid based clustering, distribution based
clustering etc?
Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.
The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.
These classes encode only the difference between these algorithms.
Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.