buffer of clusters in a sparse matrix - matlab

I work with MATLAB.
I have a sparse matrix where I identified different clusters. The value within each cluster is equal, while each cluster has its own unique value. I have 0s as background (outside clusters). Here is an example with clusters 1 and 2:
A=(000002002000
110002222000
111000222200
110000022000
111000000000)
I'd like to use each cluster as "a polygon" and study the value of the outside neighbor pixels (a sort of buffer as in vector data). Obviously in the example it would output 0 as mean all the time, but the point is understanding how to do it, as I have to apply this to another matrix (I work with geolocated data, so I would use the buffer area to find mean values in specific rasters). Is there a way to do that? Also, if so, can I specify the width of this buffer (as number of pixels)?

Related

Find "complemented" bit vectors clusters

I have a huge list of bit vectors (BV) that I want to group in clusters.
The idea behind this clusters is to be able to choose later BVs from each cluster and combine them for generate a BV with (almost) all-ones (which must be maximized).
For example, imagine the 1 means an app is Up and 0 is down in node X in a specific moment in time. We want to find the min list of nodes for having the app Up:
App BV for node X in cluster 1: 1 0 0 1 0 0
App BV for node Y in cluster 2: 0 1 1 0 1 0
Combined BV for App (X+Y): 1 1 1 1 1 0
I have been checking the different cluster algorithms but I did found one that takes into account this "complemental" behavior because in this case each column of the BV is not referred to a feature (only means up or down in an specific timeframe).
Regarding other algorithms like k-means or hierarchical clustering, I do not have clear if I can include in the clustering algorithm this consideration for the later grouping.
Finally, I am using the hamming distance to determine the intra-cluster and the inter-cluster distances given that it seems to be the most appropiated metric for binary data but results show me that clusters are not closely grouped and separated among them so I wonder if I am applying the most suitable group/approximation method or even if I should filter the input data previously grouping.
Any clue or idea regarding grouping/clustering method or filtering data is welcomed.
This does not at all sound like a clustering problem.
None of these algorithms will help you.
Instead, I would rather call this a match making algorithm. But I'd assume it is at least NP-hard (it resembles set cover) to find the true optimum, so you'll need to come up with a fast approximation. Best something specific to your use case.
Also you haven't specified (you wrote + but that likely isn't what you want) how to combine two 1s. Is it xor or or? Nor if it is possible to combine more than two, and what the cost is when doing so. A strategy would be to find the nearest neighbor of the inverse bitvector for each and always combine the best pair.

Interpretation of 'ufactor' on a toy graph clustering

I am trying to do a imbalanced partition by METIS. I do not need equal number of vertices in each cluster(which is done by default in METIS). My graph has no constraints, it's a undirected unweighted graph. Here is a example toy graph clustered by METIS without no ufactor parameter.
Then, i tried with different ufactor and at value 143, METIS starts to
do the expected cluster like the following-
Can anybody interpret this. Eventually, I want to find a way to guess an ufactor from any unbalanced and undirected graph that will minimize the normalized cut without doing any balance necessarily.
Imbalance=1+(ufactor/1000). By default imbalance=1. Number of vertex in largest cluster-
imbalance*(number of vertex/number of cluster)
For first picture(default clustering)- number of vertex in larges cluster-
1*(14/2)=7, so the second cluster is also 14-7=7
In the second picture(ufactor 143)-
imbalance=1+143/1000=1.143
so, 1.143*(14/2)=8.001
That allows the largest cluster to have 8 vertex.

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

Clustering in Matlab

Hi I am trying to cluster using linkage(). Here is the code I am trying..
Y = pdist(data);
Z = linkage(Y);
T = cluster(Z,'maxclust',4096);
I am getting error as follows
The number of elements exceeds the maximum allowed size in
MATLAB.
Error in ==> linkage at 135
Z = linkagemex(Y,method);
data size is 56710*128. How can I apply the code on small chunks of data and then merge those clusters optimally?? Or any other solution to the problem.
Matlab probably cannot cluster this many objects with this algorithm.
Most likely they use distance matrixes in their implementation. A pairwise distance matrix for 56710 objects needs 56710*56709/2=1,607,983,695 entries, or some 12 GB of RAM; most likely also a working copy of this is needed. Chances are that the default Matlab data structures are not prepared to handle this amount of data (and you won't want to wait for the algorithm to finish either; probably that is why they "allow" only a certain amount).
Try using a subset, and see how well it scales. If you use 1000 instances, does it work? How long does the computation take? If you increase to 2000, how much longer does it take?

Clustering data using matlab

I'm trying to cluster my data. This is the example of my data:
genes param1 param2 ...
gene1 0.224 -0.113 ...
gene2 -0.149 -0.934 ...
I have a thousand of genes and a hundred of parameters. I wanted to cluster my data by both genes and parameters and used clustergram for it. As there are a lot of genes it's very difficult to understand anything using a picture. Now I want to have a text-information of the 15-20 biggest clusters of genes in my data. I mean 15-20 lists of genes, that belong to different clusters. How can I do this?
Thanks
This is the example of clustergram I have from my data:
There are vertical and horizontal dendrograms here. As there is a lot of rows, it's impossible to see anything on vertical dendrogram (I need only this one).
As far as I understand, dendrogram creates a binary clusters from my data, and there are N-1 clusters from N rows of data.As these are binary clusters, there is one cluster, on the next step it splits into two, then again into two and so on. Can I get information about which genes are in which clusters on the 4-th step, for example, when there are 16 clusters?
To see interesting parts of the dendrogram and heatmap more clearly, you can use the zoom button on the toolbar to select regions of interest and zoom in on them.
To find out which genes/variables are in a particular cluster, right-click on a point in one of the dendrograms that represents the cluster you're interested in, and select Export to Workspace. You'll get a structure with the following fields:
GroupNames — Cell array of text strings containing the names of the row or column groups.
RowNodeNames — Cell array of text strings containing the names of the row nodes.
ColumnNodeNames — Cell array of text strings containing the names of the column nodes.
ExprValues — An M-by-N matrix of intensity values, where M and N are the number of row nodes and of column nodes respectively.