Clustering data using matlab - matlab

I'm trying to cluster my data. This is the example of my data:
genes param1 param2 ...
gene1 0.224 -0.113 ...
gene2 -0.149 -0.934 ...
I have a thousand of genes and a hundred of parameters. I wanted to cluster my data by both genes and parameters and used clustergram for it. As there are a lot of genes it's very difficult to understand anything using a picture. Now I want to have a text-information of the 15-20 biggest clusters of genes in my data. I mean 15-20 lists of genes, that belong to different clusters. How can I do this?
Thanks
This is the example of clustergram I have from my data:
There are vertical and horizontal dendrograms here. As there is a lot of rows, it's impossible to see anything on vertical dendrogram (I need only this one).
As far as I understand, dendrogram creates a binary clusters from my data, and there are N-1 clusters from N rows of data.As these are binary clusters, there is one cluster, on the next step it splits into two, then again into two and so on. Can I get information about which genes are in which clusters on the 4-th step, for example, when there are 16 clusters?

To see interesting parts of the dendrogram and heatmap more clearly, you can use the zoom button on the toolbar to select regions of interest and zoom in on them.
To find out which genes/variables are in a particular cluster, right-click on a point in one of the dendrograms that represents the cluster you're interested in, and select Export to Workspace. You'll get a structure with the following fields:
GroupNames — Cell array of text strings containing the names of the row or column groups.
RowNodeNames — Cell array of text strings containing the names of the row nodes.
ColumnNodeNames — Cell array of text strings containing the names of the column nodes.
ExprValues — An M-by-N matrix of intensity values, where M and N are the number of row nodes and of column nodes respectively.

Related

Boxplot is broken, only showing one line

so my data centres around different treatments and how they impact the day of germination. image of dodgy boxplot data
A while ago whilst making violin plots in R to show the distribution of when germination occurs according to treatment, I attempted to add a boxplot as a descriptive statistic and was met with only one line.
I contacted many people who simply had no idea what the issue was, I used this same data in another violin plot as part of a bigger data collection with more treatments including this one.
I moved on from this and found it odd, now when I have come to perform stats tests in SPSS, I have the same problem as imaged below. When I try a Mann Whitney U test I am told "cannot compute" due to not having solely two variables, when I try a Kruskal Wallis test I am met with the dodgy boxplot below and I am told pairwise comparisons cannot be done due to less than 3 test fields (i.e. 2).
I am at an absolute loss, I have tried rewriting the data out, copying data labels with 'stratified' 'strat' 's' etc and I have no idea where the problem could lie, if anyone could give me any guidance this would be really appreciated!
Thank you
The dependent variable in question appears to have only values 1, 2, and 3 in the Stratified group. If there is at least one case with a value of 1, at least one case with a value of 3, but most values at 2, then a box plot like you're seeing would be expected. In SPSS, run the EXAMINE procedure (Analyze>Descriptive Statistics>Explore in the menus), specifying the same dependent variable and grouping variable, and asking for percentiles. The box plots should match what you're getting, and in the percentiles table you should see that Tukey's hinges show the same value of 2 for the 25th, 50th, and 75th percentiles.
Tukey's hinges are the basis for the box and the line in box plots. The line is at the median or 50th percentile, and the upper and lower box edges are at the 25th and 75h percentiles, respectively. When all three coincide, you get just a line instead of a box.
There are two types of outlying values identified in box plots in SPSS. Points greater than 1.5 box lengths below or above the box edges are outliers, marked with circles, and points greater than 3 box lengths below or above the box edges are extremes, marked with asterisks. Since the box length here is 0, anything at other values is automatically an extreme.
Pairwise comparisons following a Kruskal-Wallis test are available only when there are at least three groups, since with only two groups the overall or omnibus test has already compared the two groups. I'm not sure what the issue was when trying to run a Mann-Whitney test.

Tableau Dual Axis with different filters

I am trying to create a graph with two lines, with two filters from the same dimension.
I have a dimension which has 20+ values. I'd like one line to show data based on just one of the selected values and the other line to show a line excluding that same value.
I've tried the following:
-Creating a duplicate/copy dimension and filtering the original one with the first, and the copy with the 2nd. When I do this, the graphic disappears.
-Creating a calculated field that tries to split the measures up. This isn't letting me track the count.
I want this on the same axis; the best I've been able to do is create two sheets, one with the first filter and one with the 2nd, and stack them in a dashboard.
My end user wants the lines in the same visual, otherwise I'd be happy with the dashboard approach. Right now, though, I'd also like to know how to do this.
It is a little hard to tell exactly what you want to achieve, but the problem with filtering is common.
The principle that is important is that Tableau will filter the whole dataset by row. So duplicating the dimension you want to filter won't help as the filter on the original dimension will also filter the corresponding rows in the second dimension. Any solution has to be clever enough to work around this issue.
One solution is to build two new dimensions that use a calculation rather than a filter to create the new result. Let's say you have a dimension, [size] that has a range of numbers from 1 to 10 and you want to compare the total number of rows including and excluding the number 5. You could create a new field using a formula like if [size] <> 5 then 1 else 0 end
Summing the new field will give a count of the number of rows that don't contain a 5 and this can be compared directly to a rowcount of the original [size] field which will give the number including the value 5.
This basic principle can be extended to much more complex logic. The essential point is to realise that filters act on every row in your data and can't, by themselves, show comparisons with alternative filter choices on a single visualisation.
Depending on the nature of your problem there may be other solutions worth looking at including sets and groups but you would need to provide more specific details for users here to tell you whether they would be useful.
We can make a a set out of the values of the dimension and then place it in the required shelf. So, you will have your dimension which will plot accordingly and set which will have data as per the requirement because with filter you can't have that independence of showing data everytime you want.

single- linkage hierarchical cluster method cutting the tree

I have a dataset contains 3 categories {c1,c2 and c3}. I’m using the single- linkage hierarchical cluster method (from the matlab) to cluster the dataset. I built my own distance measure. The following figure shows the results. Note that the hierarchical cluster method clusters the data correctly; where the points of c1 (yellow) are very close to each other. And similarly, c2(green) and c3(blue).
From the figure, we can note that the distances between the points in c1 are very small comparing to c2 and c3. So, for example, If I decide to cut the tree at 8, this will results with c1, c2 and c3 will be splited into 8 clusters; where each point will be in different cluster.
How can I overcome this problem; do I need to change the clustering method? Or cut the tree at 17 and cluster the resulted clusters again?
There are different ways of extracting clusters from a dendrogram. You are not required to do a single cut (although matlab may only offer this choice). Selecting regions like you did is also reasonable, and so is cutting the dendrogram at multiple heights. But not every tools has all the capabilities.
Notice that c3 was split into two, half of which is not well separated from c2.

buffer of clusters in a sparse matrix

I work with MATLAB.
I have a sparse matrix where I identified different clusters. The value within each cluster is equal, while each cluster has its own unique value. I have 0s as background (outside clusters). Here is an example with clusters 1 and 2:
A=(000002002000
110002222000
111000222200
110000022000
111000000000)
I'd like to use each cluster as "a polygon" and study the value of the outside neighbor pixels (a sort of buffer as in vector data). Obviously in the example it would output 0 as mean all the time, but the point is understanding how to do it, as I have to apply this to another matrix (I work with geolocated data, so I would use the buffer area to find mean values in specific rasters). Is there a way to do that? Also, if so, can I specify the width of this buffer (as number of pixels)?

How can I find the minimum number of lines needed to cover all the zeros in a 2 dimensional array?

I'm trying to make a decent implementation of the hungarian algorithm however I'm stuck at how to find the minimum number of lines that cover all the zeros in an array
also I need to know these lines to make some computations later
here is the explanation:
http://www.ams.jhu.edu/~castello/362/Handouts/hungarian.pdf
in step 3 it says
Use as few lines as possible to cover all the zeros in the matrix. There is no easy rule to do this - basically trial and error.
what does trial and error mean in terms of computation? If I have for example an 2d array of 5 rows and 5 columns then
The first row can cover all the zeros, the first and second, the first row and first column, etc etc too many combinations
isn't there something more efficient than this?
thanks in advance
You need to implement bipartite matching algorithm here. You have two partitions in the graph- the vertices in one of them represent the rows and the vertices in the other one represent the columns in the table. There is an edge between rowi and columnj iff there is a 1 in cell (i,j). Then you create maximum bipartite matching. After the last iteration of the bipartite matching algorithm you need to figure out which vertices are connected via a incremental path with the source for your matching. An incremental path is path using only edges with positive capacity. You should have picture like:
row_1 col_1
/ \
/ - row_2 col_2 -\
source - .... some_edges \ sink
\ /
\ - row_n col_n /
.... /
col_m
After you find the maximum bipartite matching, find which rows and columns are reachable via positive-capacity edges from sink. Now the minimum number of rows and columns you need to scratch is found using the following algorithm - you cross out all the rows that are not reachable from the source and all the columns that are reachable and this is your optimal solution.
Hope this answer helps you.
I'm not quite sure why they told you to do trial and error. The Hungarian algorithm, however, does not take exponential time. Take a look at wikipedia, which walks you through an example of how to figure out the minimum number of lines (look at Step 3):
http://en.wikipedia.org/wiki/Hungarian_algorithm#Matrix_interpretation
The article also includes links to implementations, and some online course notes which give more complete (although also more technical) descriptions of the Hungarian algorithm than the one you're using.
Trial and error means O((n+m)!) complexity.
At most you will only need to pick min(n,m) lines, as selecting all rows or columns will cover all 0s.
I would implement a dynamic programming algorithm to solve this step for large problems.