different clusters with same method - cluster-analysis

I am stuck in a problem with hierarchical clustering. I want to make a dendrogram and a heatmap, with a distance method of correlation (d_mydata=dist(1-cor(t(mydata))) and ward.D2 as clustering method.
As a gadget in the package pheatmap you can plot the dendrogram on the left side to visualize the clusters.
The pipeline of my analysis would be this:
create the dendrogram
test how many cluster would be the optimal (k)
extract the subjects in each cluster
create a heatmap
My surprise comes up when the dendrogram plotted in the heatmap is not the same as the one plotted before even when methods are the same.
So I decided to create a pheatmap colouring by the clusters classified before by cutree and test if the colours correspond to the clusters in the dendrogram.
This is my code:
# Create test matrix
test = matrix(rnorm(200), 20, 10)
test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3
test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2
test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4
colnames(test) = paste("Test", 1:10, sep = "")
rownames(test) = paste("Gene", 1:20, sep = "")
test<-as.data.frame(test)
# Create a dendrogram with this test matrix
dist_test<-dist(test)
hc=hclust(dist_test, method="ward.D2")
plot(hc)
dend<-as.dendrogram(hc, check=F, nodePar=list(cex = .000007),leaflab="none", cex.main=3, axes=F, adjust=F)
clus2 <- as.factor(cutree(hc, k=2)) # cut tree into 2 clusters
groups<-data.frame(clus2)
groups$id<-rownames(groups)
#-----------DATAFRAME WITH mydata AND THE CLASSIFICATION OF CLUSTERS AS FACTORS---------------------
test$id<-rownames(test)
clusters<-merge(groups, test, by.x="id")
rownames(clusters)<-clusters$id
clusters$clus2<-as.character(clusters$clus2)
clusters$clus2[clusters$clus2== "1"]= "cluster1"
clusters$clus2[clusters$clus2=="2"]<-"cluster2"
plot(dend,
main = "test",
horiz = TRUE, leaflab = "none")
d_clusters<-dist(1-cor(t(clusters[,7:10])))
hc_cl=hclust(d_clusters, method="ward.D2")
annotation_col = data.frame(
Path = factor(colnames(clusters[3:12]))
)
rownames(annotation_col) = colnames(clusters[3:12])
annotation_row = data.frame(
Group = factor(clusters$clus2)
)
rownames(annotation_row) = rownames(clusters)
# Specify colors
ann_colors = list(
Path= c(Test1="darkseagreen", Test2="lavenderblush2", Test3="lightcyan3", Test4="mediumpurple", Test5="red", Test6="blue", Test7="brown", Test8="pink", Test9="black", Test10="grey"),
Group = c(cluster1="yellow", cluster2="blue")
)
require(RColorBrewer)
library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdYlBu"))(20)
library(pheatmap)
pheatmap(clusters[ ,3:12], color = rev(cols),
scale = "column",
kmeans_k = NA,
show_rownames = F, show_colnames = T,
main = "Heatmap CK14, CK5/6, GATA3 and FOXA1 n=492 SCALE",
clustering_method = "ward.D2",
cluster_rows = TRUE, cluster_cols = TRUE,
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
annotation_row = annotation_row,
annotation_col = annotation_col,
annotation_colors=ann_colors
)
anyone with the same issue? Am I making an stupid mistake?
Thank you in advance

Related

summary row with gtsummary

I am trying to create a table of events with gtsummary and I would like to obtain a final row counting the events of the previous rows. add_overall() and add_n() do add the total but in a column, counting the same event across groups but not the overall events.
I created this example.
x1 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.85, 0.15))
x2 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.9, 0.1))
x3 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.75, 0.25))
y <- sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
df <- data.frame(as_factor(x1), as_factor(x2), as_factor(x3), as_factor(y))
colnames(df) <-c("event_1", "event_2", "event_3", "group")
tbl_summary(df, by=group, statistic = all_categorical() ~ "{n}")
example
I tried using summary_rows() function from gt package after converting the table to a gt object but there is an error when summarising because these variables are factors.
Any other ideas?
You can do this by adding a new variable to your data frame that is the row sum of each of the events. Then you can display that variable's sum in the summary table. Example below!
library(gtsummary)
#> #Uighur
library(tidyverse)
df <-
data.frame(
event_1 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.85, 0.15)),
event_2 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.9, 0.1)),
event_3 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.75, 0.25)),
group = sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
) |>
rowwise() |>
mutate(Total = sum(event_1, event_2, event_3))
tbl_summary(
df,
by = group,
type = Total ~ "continuous",
statistic =
list(all_categorical() ~ "{n}",
all_continuous() ~ "{sum}")
) |>
as_kable() # convert to kable to display on stack overflow
Characteristic
A, N = 16
B, N = 14
event_1
4
4
event_2
1
2
event_3
7
6
Total
12
12
Created on 2023-01-12 with reprex v2.0.2
Thank you so much (great package gtsummary). That works! I had some trouble summing over factors. If variables are factors the code
mutate(Total = sum(event_1=="Yes", event_2=="Yes", event_3=="Yes"))
does it.

mutlievel data simulation using simglm: how to i simulate random effects at level 1

I want to add random effects at level 1. Below is the working code with level 2 simulated.
I have two questions i hope folks can help with
How do i get a reasonable estimate of level 2 variance (assuming i have sample data). Can i just square the between person SD on the dv?
how do simulate level 1 variance and how do i determine a reasonable value at level 1.
I've tried: randomeffect = list(int_neighborhood = list(variance = 8, var_level = 2),
weight= list(variance = 8, var_level = 1 ))
but that kicks an error
This code works without level 1
ctrl <- lmeControl(opt='optim');
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex + (1 | neighborhood),
reg_weights = c(4, -0.03, 0.2, 0.33),
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
randomeffect = list(int_neighborhood = list(variance = 8, var_level = 2)),
sample_size = list(level1 = 62, level2 = 60)
)
nested_data <- sim_arguments %>%
simulate_fixed(data = NULL, .) %>%
simulate_randomeffect(sim_arguments) %>%
simulate_error(sim_arguments) %>%
generate_response(sim_arguments)
RandomIntercept <- lme(fixed= y ~1 + weight + age + sex ,
random= ~ 1 | neighborhood,
correlation = corAR1(),
data=nested_data,
control=ctrl,
na.action=na.exclude)
summary(RandomIntercept)
RandomSlope <-lme(fixed= y ~1 + weight + age + sex ,
random= ~ 1 +weight| neighborhood,
correlation = corAR1(),
data=nested_data,
control=ctrl,
na.action=na.exclude)
summary(RandomSlope)
anova(RandomIntercept,RandomSlope)

Machine Translation FFN : Dimension problem due to window size

this is my first time creating a FFN to train it to translate French to English using word prediction:
Input are two arrays of size 2 x window_size + 1 from source language and window_size target language. And the label of size 1
For e.g for window_size = 2:
["je","mange", "la", "pomme","avec"]
and
["I", "eat"]
So the input of size [5] and [2] after concatenating => 7
Label: "the" (refering to "la" in French)
The label is changed to one-hot-encoding before comparing with yHat
I'm using unique index for each word ( 1 to len(vocab) ) and train using the index (not the words)
The output of the FFN is a probability of the size of the vocab of the target language
The problem is that the FFN doesn't learn and the accuracy stays at 0.
When I print the size of y_final (target probability) and yHat (Model Hypo) they have different dimensions:
yHat.size()=[512, 7, 10212]
with 64 batch_size, 7 is the concatenated input size and 10212 size of target vocab, while
y_final.size()= [512, 10212]
And over all the forward method I have these sizes:
torch.Size([512, 5, 32])
torch.Size([512, 5, 64])
torch.Size([512, 5, 64])
torch.Size([512, 2, 256])
torch.Size([512, 2, 32])
torch.Size([512, 2, 64])
torch.Size([512, 2, 64])
torch.Size([512, 7, 64])
torch.Size([512, 7, 128])
torch.Size([512, 7, 10212])
Since the accuracy augments when yHat = y_final then I thought that it is never the case because they don't even have the same shapes (2D vs 3D). Is this the problem ?
Please refer to the code and if you need any other info please tell me.
The code is working fine, no errors.
trainingData = TensorDataset(encoded_source_windows, encoded_target_windows, encoded_labels)
# print(trainingData)
batchsize = 512
trainingLoader = DataLoader(trainingData, batch_size=batchsize, drop_last=True)
def ffnModel(vocabSize1,vocabSize2, learningRate=0.01):
class ffNetwork(nn.Module):
def __init__(self):
super().__init__()
self.embeds_src = nn.Embedding(vocabSize1, 256)
self.embeds_target = nn.Embedding(vocabSize2, 256)
# input layer
self.inputSource = nn.Linear(256, 32)
self.inputTarget = nn.Linear(256, 32)
# hidden layer 1
self.fc1 = nn.Linear(32, 64)
self.bnormS = nn.BatchNorm1d(5)
self.bnormT = nn.BatchNorm1d(2)
# Layer(s) afer Concatenation:
self.fc2 = nn.Linear(64,128)
self.output = nn.Linear(128, vocabSize2)
self.softmaaax = nn.Softmax(dim=0)
# forward pass
def forward(self, xSource, xTarget):
xSource = self.embeds_src(xSource)
xSource = F.relu(self.inputSource(xSource))
xSource = F.relu(self.fc1(xSource))
xSource = self.bnormS(xSource)
xTarget = self.embeds_target(xTarget)
xTarget = F.relu(self.inputTarget(xTarget))
xTarget = F.relu(self.fc1(xTarget))
xTarget = self.bnormT(xTarget)
xCat = torch.cat((xSource, xTarget), dim=1)#dim=128 or 1 ?
xCat = F.relu(self.fc2(xCat))
print(xCat.size())
xCat = self.softmaaax(self.output(xCat))
return xCat
# creating instance of the class
net = ffNetwork()
# loss function
lossfun = nn.CrossEntropyLoss()
# lossfun = nn.NLLLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)
return net, lossfun, optimizer
def trainModel(vocabSize1,vocabSize2, learningRate):
# number of epochs
numepochs = 64
# create a new Model instance
net, lossfun, optimizer = ffnModel(vocabSize1,vocabSize2, learningRate)
# initialize losses
losses = torch.zeros(numepochs)
trainAcc = []
# loop over training data batches
batchAcc = []
batchLoss = []
for epochi in range(numepochs):
#Switching on training mode
net.train()
# loop over training data batches
batchAcc = []
batchLoss = []
for A, B, y in tqdm(trainingLoader):
# forward pass and loss
final_y = []
for i in range(y.size(dim=0)):
yy = [0] * target_vocab_length
yy[y[i]] = 1
final_y.append(yy)
final_y = torch.tensor(final_y)
yHat = net(A, B)
loss = lossfun(yHat, final_y)
################
print("\n yHat.size()")
print(yHat.size())
print("final_y.size()")
print(final_y.size())
# backprop
optimizer.zero_grad()
loss.backward()
optimizer.step()
# loss from this batch
batchLoss.append(loss.item())
print(f'batchLoss: {loss.item()}')
#Accuracy calculator:
matches = torch.argmax(yHat) == final_y # booleans (false/true)
matchesNumeric = matches.float() # convert to numbers (0/1)
accuracyPct = 100 * torch.mean(matchesNumeric) # average and x100
batchAcc.append(accuracyPct) # add to list of accuracies
print(f'accuracyPct: {accuracyPct}')
trainAcc.append(np.mean(batchAcc))
losses[epochi] = np.mean(batchLoss)
return trainAcc,losses,net
trainAcc,losses,net = trainModel(len(source_vocab),len(target_vocab), 0.01)
print(trainAcc)

"IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices." Pls slove this

I'm trying the cluster a dataset using a hierarchical clustering algorithm, by I got an error in the last step.
plt.figure(figsize = (10,7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(df_nor,method = "ward"))
plt.axhline(y =6, color = 'black' , linestyle = '--')
Dendrogram
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean', linkage='ward')
cluster.fit_predict(df_nor)
The array
a = (df_nor['Milk'])
b = (df_nor['Grocery'])
plt.figure(figsize=(10, 7))
plt.scatter(a,b, c=cluster.labels_)
error

Exclude items from training set data

I have my data in two colors and excluded_colors.
colors contains all colors
excluded_colors contains some colors that I wish to exclude from my trainingset.
I am trying to split the data into a training and testing set and ensure that the colors in excluded_colors are not in my training set but exist in the testing set.
In order to achieve the above, I did this
var colors = spark.sql("""
select colors.*
from colors
LEFT JOIN excluded_colors
ON excluded_colors.color_id = colors.color_id
where excluded_colors.color_id IS NULL
"""
)
val trainer: (Int => Int) = (arg:Int) => 0
val sqlTrainer = udf(trainer)
val tester: (Int => Int) = (arg:Int) => 1
val sqlTester = udf(tester)
val rsplit = colors.randomSplit(Array(0.7, 0.3))
val train_colors = splits(0).select("color_id").withColumn("test",sqlTrainer(col("color_id")))
val test_colors = splits(1).select("color_id").withColumn("test",sqlTester(col("color_id")))
However, I'm realizing that by doing the above the colors in excluded_colors are completely ignored. They are not even in my testing set.
Question
How can I split the data in 70/30 while also ensuring that the colors in excluded_colors are not in training but are present in testing.
What we want to do is remove the "excluded colors" from the training set but have them in the testing and have a training/test split of 70/30.
What we need is a bit of math.
Given the total dataset (TD) and the excluded colors dataset (E) we can say that for train dataset (Tr) and test dataset (Ts) that:
|Tr| = x * (|TD|-|E|)
|Ts| = |E| + (1-x) * |TD|
We also know that |Tr| = 0.7 |TD|
Hence x = 0.7 |TD| / (|TD| - |E|)
Now that we know the sampling factor x, we can say:
Tr = (TD-E).sample(withReplacement = false, fraction = x)
// where (TD - E) is the result of the SQL expr above
Ts = TD.sample(withReplacement = false, fraction = 0.3)
// we sample the test set from the original dataset