SPS format of representative sequence from seqrep - traminer

Does anyone know how to extract the representative sequences form seqrep output, given in SPS format? It would be very helpful for viewers of the plot from seqrplot.

You can print the output with the format = 'SPS' argument. I illustrate with the biofam data
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
## Computing the distance matrix
costs <- seqcost(biofam.seq, method="INDELSLOG")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs$sm, indel=costs$indel)
## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, diss=biofam.om, criterion="density")
print(biofam.rep, format="SPS")
## Sequence
## [1] (0,16)
## [2] (0,9)-(3,1)-(6,6)
## [3] (0,6)-(1,10)
And if you want to retrieve the representative sequences in SPS form, you can use seqformat and seqconc:
biofam.rep.sps <- seqconc(seqformat(biofam.rep, to='SPS'))
The result is a single column matrix with each representative sequence stored as a character string.

Related

K-means clustering error: only 0's may be mixed with negative subscripts

I am trying to do kmeans clustering on IRIS data in R. I want to use KKZ option for the seed selection (starting points of clusters).
If i dont standardize the data i have no issues with the KKZ command:
library(inaparc)
res<- kkz(x=iris[,1:4], k=3)
seed <- res$v # this gives me the cluster seeds based on KKZ method
k1 <- kmeans(iris[,1:4], seed, iter.max=1000)
However, when i scale the data first, then kkz command gives me the error:
library(ClusterR)
dat <- center_scale(iris[1:4], mean_center = TRUE, sd_scale = TRUE) # scale iris data
res2 <- kkz(x=dat, k=3)
**Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts**
I think this is an array indexing thing but not sure what it is and how to solve it.
For some reason, kkz cannot take in anything with a mixture of positive and negative values. I have a lot of problems running it, for example:
#ok
set.seed(1000)
kkz(matrix(rnorm(1000,5,1),100,10),3)
# not ok
kkz(matrix(rnorm(1000,0,1),100,10),3)
Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts
You don't really need to center your values, so you can do:
dat <- center_scale(iris[1:4], mean_center = FALSE, sd_scale = TRUE)
res2 <- kkz(x=dat, k=3)
I would be quite cautious about using this package..until you figure out why it is so..

P-adjustment (FDR) on Hierarchical Clustering On Principle Components (HCPC) in R

I'm working right now with "Hierarchical Clustering On Principle Components (HCPC)". In the end of the analysis, p-values are computed by the HCPC function.
I searched but I couldn't find any function that could adjust the p-value based on FDR together with HCPC. It's really important to avoid any junk data in my multivariate set. Therefore my question is how can I run together with HCPC the p-value adjustment?
This is what I'm doing right now:
#install.packages(c("FactoMineR", "factoextra", "missMDA"))
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(missMDA)
library(data.table)
MyData <- fread('https://drive.google.com/open?
id=1y1YbIXtUssEBqmMSEbiQGcoV5j2Bz31k')
row.names(MyData) <- MyData$ID
MyData [1] <- NULL
Mydata_frame <- data.frame(MyData)
# Compute PCA with ncp = 3 (Variate based on the cluster number)
Mydata_frame.pca <- PCA(Mydata_frame, ncp = 2, graph = FALSE)
# Compute hierarchical clustering on principal components
Mydata.hcpc <- HCPC(Mydata_frame.pca, graph = FALSE)
Mydata.hcpc$desc.var$quanti
v.test Mean in category
Overall mean sd in category Overall sd p.value
CD8RAnegDRpos 12.965378 -0.059993483
-0.3760962775 0.46726224 0.53192037 1.922798e-38
TregRAnegDRpos 12.892725 0.489753272
0.1381306362 0.46877083 0.59502553 4.946490e-38
mTregCCR6pos197neg195neg 12.829277 1.107851623
0.6495813704 0.48972987 0.77933283 1.124088e-37
CD8posCCR6neg183neg194neg 12.667318 1.741757598
1.1735140264 0.45260338 0.97870842 8.972977e-37
mTregCCR6neg197neg195neg 12.109074 1.044905184
0.6408258230 0.51417779 0.72804665 9.455537e-34
CD8CD8posCD4neg 11.306215 0.724115486
0.4320918842 0.49823677 0.56351333 1.222504e-29
CD8posCCR6pos183pos194neg 11.226390 -0.239967805
-0.4982954123 0.49454619 0.50203520 3.025904e-29
TconvRAnegDRpos 11.011114 -0.296585038
-0.5279707475 0.44863446 0.45846770 3.378002e-28

How to apply word2vec for k-means clustering?

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.

Filtering by Column in Matlab for a list or variety of values

I've produced a code which separates data within a text file into the required format, filters the data and averages the output (in this case, the value in the fourth column)
I am trying to filter the data in column one for a list of values at the same time, with no strict pattern for the values. e.g 1001, 1007, 1048, 1192, 1200 ....
Currently my code only filters by a certain value (1001) is there a way of incorporating a list of values into this function?
C_f = C(C(:,1) == 1001 , :);
Any help would be much appreciated!
See if this is what you want,
val = [1000 1001];
ind = ismember(C(:,1),val);
C_f = C(ind,:)

Format of training data in CSV type for encog 3.0 and using it

I wonder that how can I make a csv file for storing training data in encog. Currently I have 200 features (f) as inputs and multi outputs (o) (for example author A, B ,C...). So how can organize the CSV file ? Should I look like this?
f1, f2, f3 ... f200, o1
f1, f2, f3 ... f200, o2
f1, f2, f3 ... f200, o3
Some of my questions are:
Can o1, o2 and o3 accept String ? (Authors' names).
Will the format of training csv file and testing cvs file look the same ?
Is it possible to feed the NN directly with the CSV file ? Or It must be converted to multi dimension array as this examples ? Since I have to 200 features as inputs, this will quite difficult.
double XOR_INPUT[][] = [
[0,0],
[1,0],
[0,1],
[1,1]
];
How to normalize the data in the csv file (to -+1 range) by using encog framework ?
Thank you very much.
No. A neural network operates only with float numbers, preferably 0 to 1 (output) or -1 to 1 (input). For strings, use 1 of n encoding.
So eg. if your outputs are 'a', 'b', 'c', set it to
1 0 0 = 'a'
0 1 0 = 'b'
0 0 1 = 'c'
You can also add a null class if necessary, for no result found.
You can read the data from csv, but encog is looking for everything in a 2d double array (or more correctly an 'array of arrays').
To simplify things, start with say 10 features.
Normalization is done per feature. So for each feature, the formula for normalization for a datapoint a is:
((a - min) / range) + 1
Where range = max - min of that feature.
So all input datapoints should be in the range -1 to 1.
Maybe post a real example of the data, that might give a better impression of what you need to do.