P-adjustment (FDR) on Hierarchical Clustering On Principle Components (HCPC) in R - cluster-analysis

I'm working right now with "Hierarchical Clustering On Principle Components (HCPC)". In the end of the analysis, p-values are computed by the HCPC function.
I searched but I couldn't find any function that could adjust the p-value based on FDR together with HCPC. It's really important to avoid any junk data in my multivariate set. Therefore my question is how can I run together with HCPC the p-value adjustment?
This is what I'm doing right now:
#install.packages(c("FactoMineR", "factoextra", "missMDA"))
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(missMDA)
library(data.table)
MyData <- fread('https://drive.google.com/open?
id=1y1YbIXtUssEBqmMSEbiQGcoV5j2Bz31k')
row.names(MyData) <- MyData$ID
MyData [1] <- NULL
Mydata_frame <- data.frame(MyData)
# Compute PCA with ncp = 3 (Variate based on the cluster number)
Mydata_frame.pca <- PCA(Mydata_frame, ncp = 2, graph = FALSE)
# Compute hierarchical clustering on principal components
Mydata.hcpc <- HCPC(Mydata_frame.pca, graph = FALSE)
Mydata.hcpc$desc.var$quanti
v.test Mean in category
Overall mean sd in category Overall sd p.value
CD8RAnegDRpos 12.965378 -0.059993483
-0.3760962775 0.46726224 0.53192037 1.922798e-38
TregRAnegDRpos 12.892725 0.489753272
0.1381306362 0.46877083 0.59502553 4.946490e-38
mTregCCR6pos197neg195neg 12.829277 1.107851623
0.6495813704 0.48972987 0.77933283 1.124088e-37
CD8posCCR6neg183neg194neg 12.667318 1.741757598
1.1735140264 0.45260338 0.97870842 8.972977e-37
mTregCCR6neg197neg195neg 12.109074 1.044905184
0.6408258230 0.51417779 0.72804665 9.455537e-34
CD8CD8posCD4neg 11.306215 0.724115486
0.4320918842 0.49823677 0.56351333 1.222504e-29
CD8posCCR6pos183pos194neg 11.226390 -0.239967805
-0.4982954123 0.49454619 0.50203520 3.025904e-29
TconvRAnegDRpos 11.011114 -0.296585038
-0.5279707475 0.44863446 0.45846770 3.378002e-28

Related

Cross variogram negative : flip upside down

Using the "Cokriging" method, I first did a cross variogram. However, I had a negative relationship between my two variables taken, which led to the variogram crossed upside down. I was wondering if it was possible to overthrow it so that it would be the same as the other two
CEC: cationic echange capacity
IB.samples : Soil sealing index
chart.Correlation(Newdata[,.(A,pH_KCL,CEC,IB.samples)])
Chart correlation
g <- gstat(id="IB.samples", formula=IB.samples~1,
data= na.omit(Newdata),
locations=~x+y)
g <- gstat(g, id="CEC.res", formula=CEC.res~1,
data= na.omit(Newdata),
locations=~x+y)
v.cross <- variogram(g, cutoff=170 ,
width=15)
plot(v.cross, pch = 16, col='black')
lmc.model <- vgm(psill=0, model="Sph"
,
range=100, nugget=0)
LMC <- fit.lmc(v.cross, g, fit.method=7,
model=lmc.model,correct.diagonal = 0.95)
plot(v.cross, LMC, pch=10, col='black')
Cross variogram
Possible solution, but sounds a little boorish :
Perhaps I could change the sign of one of my variables to have a possitive correlation between them but I did not succeed in doing so.

K-means clustering error: only 0's may be mixed with negative subscripts

I am trying to do kmeans clustering on IRIS data in R. I want to use KKZ option for the seed selection (starting points of clusters).
If i dont standardize the data i have no issues with the KKZ command:
library(inaparc)
res<- kkz(x=iris[,1:4], k=3)
seed <- res$v # this gives me the cluster seeds based on KKZ method
k1 <- kmeans(iris[,1:4], seed, iter.max=1000)
However, when i scale the data first, then kkz command gives me the error:
library(ClusterR)
dat <- center_scale(iris[1:4], mean_center = TRUE, sd_scale = TRUE) # scale iris data
res2 <- kkz(x=dat, k=3)
**Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts**
I think this is an array indexing thing but not sure what it is and how to solve it.
For some reason, kkz cannot take in anything with a mixture of positive and negative values. I have a lot of problems running it, for example:
#ok
set.seed(1000)
kkz(matrix(rnorm(1000,5,1),100,10),3)
# not ok
kkz(matrix(rnorm(1000,0,1),100,10),3)
Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts
You don't really need to center your values, so you can do:
dat <- center_scale(iris[1:4], mean_center = FALSE, sd_scale = TRUE)
res2 <- kkz(x=dat, k=3)
I would be quite cautious about using this package..until you figure out why it is so..

Prediction in R doesn't work following rpart

I am working on a tree regression. Everything works fine with my code but I don't get the predicted values at all. Instead I get all values for my y variable (response variable). Here's the code:
Separating in train and test set for data
`sample = sample.split(Data, SplitRatio = .80)
train = subset(Data, sample == TRUE)
test = subset(Data, sample == FALSE)
varYTrain <- train[c(3)]
varYTest <- test[c(3)]
varXTrain <- train[c(5:27)]
varXTest <- test[c(5:27)]`
Model
`x <- cbind(varXTrain,varYTrain)
fit <- rpart(as.matrix(varYTrain) ~ ., data = x, method="class")
summary(fit)`
This one doesn't work as I don't get predictions based on test data set for an unknown reason
`predicted <- predict(fit, data=varXTest)
summary(predicted)`
I would also like in the end for the output to compare predicted values to real values in my dataset, can I do that?
Thank you very much and don't hesitate to ask me a question if I am not clear enough it's my first time posting.
Cheers

a function for bigglm model selection like dredge working for glm

I was using glm with dredge in MuMIn package. But now since my data is large I am using bigglm from biglm package. Now how do I do model selection now since dredge does not work with bigglm? Is there another package I can use to achieve this?
On applying the dredge on bigglm I am receiving the following error:
Error in nobs.default(global.model) : no 'nobs' method is available
dredge relies on availability of logLik method for the the given model class. big[g]lm object does not provide such value, and there seems to be a long known bug in the AIC method for big[g]lm-class that makes it impossible to calculate LL from it (it uses deviance rather than LL to calculate AIC, so AIC-values are not comparable to other model types, see here: AIC different between biglm and lm).
You could try adding the missing methods (using deviance instead of LL, which may be slippery):
# incorrect if any prior weights are 0
nobs.biglm <- function (object, ...) object$n
logLik.bigglm <- function(object, ...) {
dev <- deviance(object, ...)
df <- object$n - object$df.resid
structure(dev, df = df, nobs = object$n)
}
coefTable.biglm <- function (model, data, ...) {
ct <- summary(model)$mat[, c(1L,4L,5L), drop = FALSE]
.makeCoefTable(ct[, 1L], se = ct[, 2L], df = model$df.resid, coefNames = rownames(ct))
}
environment(coefTable.biglm) <- asNamespace("MuMIn")
#from example(bigglm)
fm <- bigglm(log(Volume)~log(Girth)+log(Height),data=trees, chunksize=10, sandwich=TRUE)
dredge(fm, rank = AIC)

Produce State Frequencies and Entropy Index for a Particular Variable

I am able to generate separate plots from my data set (DISDARAE) for different variables (GENDER, RACE) such as
seqIplot(DISDATAE.seq, border = NA, group = DISDATAE$GENDER, sortv = "from.start")
seqIplot(DISDATAE.seq, border = NA, group = DISDATAE$RACE, sortv = "from.start")
How do I generate separate state frequency and entropy tables for each variable?
I used this syntax for the entire data set: seqstatd(DISDATAE.seq[, 1:4]), but unable to create one for separate variables
Just use by. I illustrate using the mvad data shipping with TraMineR
library(TraMineR)
data(mvad)
# creating the state sequence object
mvad.seq <- seqdef(mvad[, 15:86])
## Distributions and cross-sectional entropies by sex
by(mvad.seq, mvad$male, seqstatd)
Hope this helps.