We have spark clusters with 100-200 nodes and we plot several metrics of executors, driver
We are not sure what's the best way to create a dashboard at such scale? Visualizing all the 100-200 nodes and executor stats doesn't surface the problem as there is lot of noise. It also slows down the dashboard tremendously
What are some good practices around grafana dashboards?
Visualize using top K
Plot only anomalies? How do we detect anomalies?
How to reduce noise?
How to make the dashboard more performant?
We use prometheus in the backend
Related
How to adjust the parameters of cluster analysis, if the subject area is not familiar to you, does not contain information "noise" and anomalies in the data, but you know that potential clusters have a "banana-like" shape?
Thanks all in advance! I'm pretty new in unsupervised learning.
I have a web usage dataset with inflated zeros. The zeros are simply because user didn't engage with a web feature. The usages are relatively low for majority of my variables.
I clustered my data with K-Means, the silhouette score looks not very convincing as it looks like I only one good cluster and it's really big:. The plots are number of clusters and silhouette score from 2 to 5 clusters.
It looks like the 3 clusters have higher score. I went ahead to cluster with k=3. If I plot a polar plot, it looks like the clusters are very clear.
I'm really confused, are my clustered result valid or not?
With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster
I have to cluster data which are power profiles of the solar panel output. I tried various algorithm including classical K-means to shape based clustering as well. I have to decide number of cluster possible in the pool of data. And I am always getting 2 cluster, so I think they are very dense.
Is there any way I can partition dense cluster?
When I cluster my data (with any clustering approach) and compute the quality metrics (I tried several metrics, silhouette, Dunn, etc), I get very poor scores.
What I'm interested in is that whether my data is clusterable or not? Is there any methods to assess that? Or a method telling me if the data contain any useful information?
Thanks,
Hamid
Maybe it just doesn't have clusters?
Or they do not fit to the model evaluated by Silhouette, Dunn etc. - these metrics can be quite misleading, in particular when you have noise in your data set, too. Don't blindly trust such metrics.
The best way of seeing if your data can be clustered is visualization. If you can't visualize it in a way you see clusters, how can you expect an algorithm to return meaningful clusters?