Cluster Size is too big after BIRCH clustering - cluster-analysis

I have a data of 2,4million row and about 56 variables. I was doing sampling of 10000 data and do PCA into 10 dimensions
Then I use BIRCH clustering as k-means and hierarchical were showing bad silhoutte coefficient. Scikit says that the usecase of BIRCH is large dataset and data reduction
As the result, I get 4 clusters with Silhoutte coefficient of 0,4 (-1 is the worst, 1 is the best) which I think it is good enough. The problem is, the first cluster size is too big, it get 94% of all data, meanwhile the other clusters only get 6%
So my questions are ; Do PCA and Sampling affect the BIRCH clustering result? And what can be done to cluster that dominate the size?
I am thinking of either do re-clustering to the 94% or just accept the fact that 94% of my data is really have the same cluster.
Thanks

Related

Taking big chunk of Time while Running K means on Python Spark

I have a nparray vector with 0s and 1s with 37k rows and 6k columns.
When I try to run Kmeans Clustering in Pyspark, it takes almost forever to load and I cannot get the output. Is there any way to reduce the processing time or any other tricks to solve this issue?
I think that you may have too many columns, you could have faced the dimensionality course. Wikipedia link
[...] The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. [...]
In order to solve this problem, did you consider reducing your columns, using only relevant ones? Check again this Wikipedia link
[...] Feature projection transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. [...]

Clustering, Large dataset, learning large number vocabulary words

I am try to do clustering from a large dataset dim:
rows: 1.4 million
cols:900
expected number of clusters: 10,000 (10k)
Problem is : size of my dataset 10Gb, and I have RAM of 16Gb. I am trying to implement in Matlab. It will be big help for me if someone could response to it.
P.S. So far i have tried with hierarchical clustering. in one paper, tehy have suggested to go for "fixed radius incremental pre-clustering". But I didnt understand the procedure.
Thanks in advance.
Use some algorithm that does not require a distance matrix. Instead, choose one that can be index accelerated.
Anuthing with a distance matrix will exceed your memory. But even when not requiring this (e.g., SLINK uses only O(n) memory) it still may take too long. Indexes could reduce the runtime to O(n log n) although on your data, indexes may have problems.
Index accelerated algorithms are for example: OPTICS, DBSCAN.
Just don't use the really bad Matlab scripts for these algorithms.

How to predict training time (and/or required RAM) for neural network training?

This may be too general of a question, but in essence I'm trying to predict how long it will take to train a feed forward neural network for regression given the number of training data, and number/organization of nodes. I need to do this because I will be using a university cluster to perform the training and when I submit the job I need to do so while specifying expected time, as well as requesting RAM and number of CPU cores.
I am working in Matlab.
Right now my network it has as input a 400-element vector and outputting a 36 element vector (regression fit to training data).
Given 5000 training data points, how can I predict what resources I will need/how long it will take to run on these resources? (I can have up to 1tb of RAM, and there may be ways to implement gpu's but I haven't figured out how to do so yet.)
Any help or pointers to other resources would be very appreciated.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

What is the effect of changing the maximum value of iterations in k-means clustering?

In Matlab, I'm creating a visual codebook using Bag of Features with the SURF features of 3913 images and k = 450. I train an SVM classifier with the visual codebook, and then use it to classify video frames to detect humans. The video I'm using is an aerial one. My maximum number of iterations is 100 by default, but when I ran the code, I get a warning from Matlab that says "Failed convergence at 100 iterations". What does this mean? Does it affect my clustering? I only have 2 classes: person and nonperson. Does it also mean that I have to increase my maximum iterations for better results or do I have to decrease it?
When you say 100 iterations, are you talking about the clustering, i. e. building the "visual vocabulary"? If so, then the message you are getting would indicate that the k-means clustering was not able to converge after 100 iterations. That means the centers of clusters are moving after each iteration by an amount greater than what is specified in the convergence criterion. The most reasonable thing to do would be to run k-means for more iterations.