K-means clustering error: only 0's may be mixed with negative subscripts - cluster-analysis

I am trying to do kmeans clustering on IRIS data in R. I want to use KKZ option for the seed selection (starting points of clusters).
If i dont standardize the data i have no issues with the KKZ command:
library(inaparc)
res<- kkz(x=iris[,1:4], k=3)
seed <- res$v # this gives me the cluster seeds based on KKZ method
k1 <- kmeans(iris[,1:4], seed, iter.max=1000)
However, when i scale the data first, then kkz command gives me the error:
library(ClusterR)
dat <- center_scale(iris[1:4], mean_center = TRUE, sd_scale = TRUE) # scale iris data
res2 <- kkz(x=dat, k=3)
**Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts**
I think this is an array indexing thing but not sure what it is and how to solve it.

For some reason, kkz cannot take in anything with a mixture of positive and negative values. I have a lot of problems running it, for example:
#ok
set.seed(1000)
kkz(matrix(rnorm(1000,5,1),100,10),3)
# not ok
kkz(matrix(rnorm(1000,0,1),100,10),3)
Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts
You don't really need to center your values, so you can do:
dat <- center_scale(iris[1:4], mean_center = FALSE, sd_scale = TRUE)
res2 <- kkz(x=dat, k=3)
I would be quite cautious about using this package..until you figure out why it is so..

Related

PySpark approxSimilarityJoin() not returning any results

I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate

Microsoft SEAL : Required negative values as a result after subtraction of two PolyCRT composed ciphertexts

Suppose I have two vectors x = [1,2,3,4] and y = [5,1,2,6].
I composed and encrypted the two array using PolyCRTBuilder ( Ciphertextx and Ciphertexty ) .
If I subtract the two ciphertexts ( Ciphertextx MINUS Ciphertexty ), I should get Result = [-4, 1, 1, -2] but after the homomorphic subtraction I am getting ResultDecrypted = [40957, 1, 1, 40959] .
I understood that because the plaintext is only defined modulo plain_modulus, we got that result. But i want the resultant negative values to be used for the next computation how can i assign the resultant negative values to a vector and use the same for the further computations
You are using a pretty old version of SEAL if it still has PolyCRTBuilder; in newer versions of the library this has been renamed to BatchEncoder and it supports encoding to/from std::vector<std::int64_t> which, I believe, is what you want.

How to overcome indefinite matrix error (NbClust)?

I'm getting the following error when calling NbClust():
Error in NbClust(data = ds[, sapply(ds, is.numeric)], diss = NULL, distance = "euclidean", : The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.
I've called ds <- ds[complete.cases(ds),] just before running NbClust so there's no missing values.
Any idea what's behind this error?
Thanks
I had same issue in my research.
So, I had mailed to Nadia Ghazzali, who is the package maintainer, and got an answer.
I'll attached my mail and her reply.
my e-mail:
Dear Nadia Ghazzali. Hello Nadia. I have some questions about
NbClust function in R library. I have tried googling but could not
find satisfying answers. First, I’m so grateful for you to making
this awsome R library. It is very helpful for my reasearch. I tested
NbClust function in NbClust library with my own data like below.
> clust <- NbClust(data, distance = “euclidean”,
min.nc = 2, max.nc = 10, method = ‘kmeans’, index =”all”)
But soon, an error has occurred. Error: division by zero! Error in
Indices.WBT(x = jeu, cl = cl1, P = TT, s = ss, vv = vv) : object
'scott' not found So, I tried NbClust function line by line and
found that some indices, like CCC, Scott, marriot, tracecovw,
tracew, friedman, and rubin, were not calculated because of object
vv = 0. I’m not very familiar with argebra so I don’t know meaning
of eigen value. But it seems to me that object ss(which is squart of
eigenValues) should not be 0 after prodected.
So, here is my questions.
I assume that my data is so sparse(a lot of zero values) that sqrt(eigenValues) becomes too small, is that right? I’m sorry I
can’t attach my data but I can attach some part of eigenValues and
squarted eigenValues.
> head(eigenValues)
[1] 0.039769880 0.017179826 0.007011972 0.005698736 0.005164871 0.004567238
> head(sqrt(eigenValues))
[1] 0.19942387 0.13107184 0.08373752 0.07548997 0.07186704 0.06758134
And if my assume is right, what can I do for this problems? Only one
way to drop out 7 indices?
Thank you for reading and I’ll waiting your reply. Best regards!
and her reply:
Dear Hansol,
Thank you for your interest. Yes, your understanding is good.
Unfortunately, the seven indices could not be applied.
Best regards,
Nadia Ghazzali
#seni The cause of this error is data related. If you look at the source code of this function,
NbClust <- function(data, diss="NULL", distance = "euclidean", min.nc=2, max.nc=15, method = "ward", index = "all", alphaBeale = 0.1)
{
x<-0
min_nc <- min.nc
max_nc <- max.nc
jeu1 <- as.matrix(data)
numberObsBefore <- dim(jeu1)[1]
jeu <- na.omit(jeu1) # returns the object with incomplete cases removed
nn <- numberObsAfter <- dim(jeu)[1]
pp <- dim(jeu)[2]
TT <- t(jeu)%*%jeu
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn-1))$value
for (i in 1:sizeEigenTT)
{
if (eigenValues[i] < 0) {
print(paste("There are only", numberObsAfter,"nonmissing observations out of a possible", numberObsBefore ,"observations."))
stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
}
}
And I think the root cause of this error is the negative eigenvalues that seep in when the number of clusters is very high, i.e. the max.nc is high. So to solve the problem, you must look at your data. See if it got more columns then rows. Remove missing values, check for issues like collinearity & multicollinearity, variance, covariance etc.
For the other error, invalid clustering method, look at the source code of the method here. Look at line number 168, 169 in the given link. You are getting this error message because the clustering method is empty. if (is.na(method))
stop("invalid clustering method")

P-adjustment (FDR) on Hierarchical Clustering On Principle Components (HCPC) in R

I'm working right now with "Hierarchical Clustering On Principle Components (HCPC)". In the end of the analysis, p-values are computed by the HCPC function.
I searched but I couldn't find any function that could adjust the p-value based on FDR together with HCPC. It's really important to avoid any junk data in my multivariate set. Therefore my question is how can I run together with HCPC the p-value adjustment?
This is what I'm doing right now:
#install.packages(c("FactoMineR", "factoextra", "missMDA"))
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(missMDA)
library(data.table)
MyData <- fread('https://drive.google.com/open?
id=1y1YbIXtUssEBqmMSEbiQGcoV5j2Bz31k')
row.names(MyData) <- MyData$ID
MyData [1] <- NULL
Mydata_frame <- data.frame(MyData)
# Compute PCA with ncp = 3 (Variate based on the cluster number)
Mydata_frame.pca <- PCA(Mydata_frame, ncp = 2, graph = FALSE)
# Compute hierarchical clustering on principal components
Mydata.hcpc <- HCPC(Mydata_frame.pca, graph = FALSE)
Mydata.hcpc$desc.var$quanti
v.test Mean in category
Overall mean sd in category Overall sd p.value
CD8RAnegDRpos 12.965378 -0.059993483
-0.3760962775 0.46726224 0.53192037 1.922798e-38
TregRAnegDRpos 12.892725 0.489753272
0.1381306362 0.46877083 0.59502553 4.946490e-38
mTregCCR6pos197neg195neg 12.829277 1.107851623
0.6495813704 0.48972987 0.77933283 1.124088e-37
CD8posCCR6neg183neg194neg 12.667318 1.741757598
1.1735140264 0.45260338 0.97870842 8.972977e-37
mTregCCR6neg197neg195neg 12.109074 1.044905184
0.6408258230 0.51417779 0.72804665 9.455537e-34
CD8CD8posCD4neg 11.306215 0.724115486
0.4320918842 0.49823677 0.56351333 1.222504e-29
CD8posCCR6pos183pos194neg 11.226390 -0.239967805
-0.4982954123 0.49454619 0.50203520 3.025904e-29
TconvRAnegDRpos 11.011114 -0.296585038
-0.5279707475 0.44863446 0.45846770 3.378002e-28

spark(scala) three separated rdd[org.apache.spark.mllib.linalg.Vector] to a single rdd[Vector]

i have three separated rdd[mllib....vectors] and i need to combine them as a one rdd[mllib vector].
val vvv = my_ds.map(x=>(scaler.transform(Vectors.dense(x(0))),Vectors.dense((x(1)/bv_max_2).toArray),Vectors.dense((x(2)/bv_max_1).toArray)))
more info:
scaler => StandardScaler
bv_max_... is nothing but the DenseVector from breeze lib in case for normalizing (x/max(x))
now i need to make them all as one
i get ([1.],[2.],[3.]) and [[1.],[2.],[3.]]
but i need [1.,2.,3.] as one vector
finally i found ... i dont know if this is the best.
i had 3d data set and i needed to perform x/max(x) normalization on two dimensions and apply standardScaler to another dimension.
my problem was that in the end i had 3 separated Vectors like: eg
[ [1.0],[4,0],[5.0] ]
[ [2.0], [5.0], [6.0]]................but i needed [1.0,4.0,5.0] which can be passed to KMeans.
i changed the above code as :
val vvv = dsx.map(x=>scaler.transform(Vectors.dense(x.days_d)).toArray ++ (x.freq_d/bv_max_freq).toArray ++ (x.food_d/bv_max_food).toArray).map(x=>Vectors.dense(x(0),x(1),x(2)))