Determining Data Homogeneity/Heterogeneity Using Clustering - cluster-analysis

I have a sequential data (i.e., that comes one instance per time). I want to determine for an amount of instances accumulated (after a while), if they are stochastic (i.e., sparse), or homogeneous (i.e., there is some correlation).
To do this I am using a sequential K-means. First, two cluster's centers are given, and the data is sequentially clustered into two classes. After a while, if I observed that the data is sparse between the two clusters, then, I say that is stochastic. However, if I observed that the data is mostly accumulated in one cluster (e.g., 70% of the data), then I say that the data is homogeneous.
Is my thinking correct?

Related

Using Multiple GPUs outside of training in PyTorch

I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus.
The code follows:
def ac_distance(layer):
total = 0
for p in layer.weight:
for q in layer.weight:
total += distance(p,q)
return total
Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both.
Parallelism while training neural networks can be achieved in two ways.
Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
Model Parallelism - Split the computations and run them on different GPUs
As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.
However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.
In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.
There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:
Do gradient accumulation for larger batches
Use DataParallel
If that doesn't suffice, go with DistributedDataParallel

Generating random clusters' centers with max distance

I have a sequential data (one instance per time) to be clustered into two classes. I want to use the sequential version of K-means (sequential K-means) for this task.
Upon randomly specifying the centers of the two clusters for the algorithm at the beginning, I want for the distance between them to be as max as possible (i.e., very away from each other) so the distribution of the the resulting two clusters will not be affected by the initial centers.
Is my thinking correct? if so, how can I do that?
Rather try to best estimate the true means. That is the optimum steategy.
If you just want to make them far apart, that can lead to badly assigned points inbetween.

'ufactor' parameter in metis for imbalance clustering

I have been using METIS for clustering social media users.
By default, it was outputting clusters with same number of vertices in each side, which is not ideal in real world scenario. So, I was trying to find way to loosen the constraint of "same number of vertices" and get possible imbalance partition with minimized cut value.
I find a parameter ufactor in the manual which is suitable(I think) for my case but I did not grasp what it is really doing. I have large graph and tried with some value of ufactor. For one data set ufactor=1000 works very well but for another dataset it could not even partition the graph. I can not interpret this result as i did not understand what it's really doing. Here is what i find in the manual about this:
Specifies the maximum allowed load imbalance among the partitions. A value of x indicates that the
allowed load imbalance is (1 + x)/1000. The load imbalance for the jth constraint is defined to be
max_i(w[j, i])/t[j, i]), where w[j, i] is the fraction of the overall weight of the jth constraint that
is assigned to the ith partition and t[j, i] is the desired target weight of the jth constraint for the
ith partition (i.e., that specified via -tpwgts). For -ptype=rb, the default value is 1 (i.e., load
imbalance of 1.001) and for -ptype=kway, the default value is 30 (i.e., load imbalance of 1.03).
Can anybody help me to interpret this? Here, what is jth constraints? what is -ptype=rb/kway?
First of all, let me mention that I think METIS is the wrong tool here, because it is used for graph partitioning, where the emphasis is on minimizing the number of edges between partitions while keeping the partitions balanced (more or less equal sizes)
What you probably want to do is community detection within social networks, i.e. the search for clusters which maximize internal connectivity (large number of edges between nodes from the same cluster) and minimize external connectivity (small number of edges between different clusters).
This can be achieved by maximizing the so-called Modularity of the clustering
There are several approaches to tackle this problem, a popular heuristic being Label propagation.
If you don't want to implement the algorithm yourself, I would recommend using a framework like NetworKit (unfortunately, I don't know any other such frameworks yet), which implements Label propagation, some modularity-based algorithms and many helpful tools.
But back to your original question:
What is -ptype=rb/kway?
There are multiple ways how you can approach the graph partitioning problem: You can either try to partition the graph into your desired number of partitions directly (k-way partitioning) or you can split the graph in half repeatedly until you have the desired number of partitions (recursive bisection, rb)
What is the jth constraint?
METIS allows you to try and optimize multiple balance constraints at the same time, i.e. if you have multiple types of calculations on the graph that should all be more or less balanced among the compute nodes.
See the manual:
Many important types of multi-phase and multi-
physics computations require that multiple quantities be load balanced simultaneously.
[...]
METIS includes partitioning routines that can be used to partition a graph in the presence of such multiple balancing
constraints. Each vertex is now assigned a vector of m weights and the objective of the partitioning routines is
to minimize the edge-cut subject to the constraints that each one of the m weights is equally distributed among the
domains.
EDIT: Since you clarified that you wanted to look at a fixed number of clusters, I see how graph partitioning could be helpful here. Let me illustrate what ufactor means:
The imbalance of a partitioned graph is (in this simple case) computed as the maximum of the imbalance for each partition, which is roughly the quotient partition size / average partition size. So if we allow a maximum imbalance of 2, this means that the largest partition is twice as big as the average partition. Note however that ufactor doesn't specify the imbalance directly, it specifies how many permille away from 1 the imbalance is allowed to be.
So ufactor=143 actually means that your maximal allowed imbalance is 1.143, which makes sense since your clusters are not that far from each other. So in your case, you will probably use larger values for ufactor to allow the groups to be of quite different sizes.
Consequences of large imbalance
If your imbalance is too large, it might happen that all the strongly-connected parts land in the same partition while only isolated nodes are put in the other partitions. This is due to the fact that the algorithm tries to minimize the number of cut edges between different partitions, which will be lower if we put all the high-degree nodes in the same partition.
What about spectral partitioning, ...?
The general approach of METIS works as follows:
Most input graphs are too large to partition directly, which is why so-called multilevel methods are used:
The graph is first coarsened (nodes are combined while trying to preserve the graph structure) until its size becomes feasible to partition directly
The coarsest graph is partitioned using an initial partitioning technique, where we could use a variety of approaches (combinatorial bisection, spectral bisection, exact solutions using ILPs, ...).
The graph is then uncoarsened, where in each step a small number of nodes are moved from partition to partition in a local search to improve the overall edge cut.
My personal recommendation
I should however note that while graph partitioning might be a valid model for your case, METIS itself might not be the ideal implementation for you:
As you can read on the METIS homepage, it is mostly used for rather sparse graphs ('finite element methods, linear programming, VLSI, and transportation'), whereas social networks are much denser and have a different structure (degrees follow a power-law distribution)
The coarsening approach of METIS uses heavy edge matching to combine nodes which are somehow close together, which works great for the intended applications, for social networks however, clustering-based coarsening techniques might prove more efficient.
Another library that is a bit slower in general, but implements some presets especially for social networks is KaHIP, see the manual for details.
(I should mention however that I am biased in this regard, since I worked extensively with this library ;-) )

Cluster assignment remapping

I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.

Identify clusters in SOM (Self Organizing Map)

Once I have collected and organized data in a SOM how do I identify clusters?
(Items are aggregated and clustered using many traits - upwards of 10)
Specifically I want to find the 'center' of the cluster - therefor giving me the 'center' node(s).
You could use a relative small map and consider each node a cluster, but this is far from optimal. If you want to apply an automated cluster detection method you should definitely read
Clustering of the Self−Organizing Map
and search similar bibliography.
You could also use more sophisticated versions of SOM algorithm (multi leveled, self growing, etc).
In any case, keep in mind that the problem of finding the "correct" number of clusters doesn't have a finite solution.
As far as I can tell, SOM is primarily a data-driven dimensionality reduction and data compression method. So it won't cluster the data for you; it may actually tend to spread clusters in the projection (i.e. split them into multiple cells).
However, it may work well for some data sets to either:
Instead of processing the full data set, work only on the SOM nodes (weighted by the number of elements assigned to them), which should be significantly smaller
Instead of working in the original space, work in the lower-dimensional space that the SOM represents
And then run a regular clustering algorithm on the transformed data.
Though an old question I've encountered the same issue and I've had some success implementing Estimating the Number of Clusters in Multivariate Data by Self-Organizing Maps, so I thought I'd share.
The linked algorithm uses the U-matrix to highlight the boundaries of the individual clusters and then uses an image processing algorithm called watershedding to identify the components. For this to work correctly the regions in the u-matrix are required to be concave within the resolution of your quantization (which when converted to a binary image, simply results in using a floodfill to identify the regions).