Produce State Frequencies and Entropy Index for a Particular Variable - traminer

I am able to generate separate plots from my data set (DISDARAE) for different variables (GENDER, RACE) such as
seqIplot(DISDATAE.seq, border = NA, group = DISDATAE$GENDER, sortv = "from.start")
seqIplot(DISDATAE.seq, border = NA, group = DISDATAE$RACE, sortv = "from.start")
How do I generate separate state frequency and entropy tables for each variable?
I used this syntax for the entire data set: seqstatd(DISDATAE.seq[, 1:4]), but unable to create one for separate variables

Just use by. I illustrate using the mvad data shipping with TraMineR
library(TraMineR)
data(mvad)
# creating the state sequence object
mvad.seq <- seqdef(mvad[, 15:86])
## Distributions and cross-sectional entropies by sex
by(mvad.seq, mvad$male, seqstatd)
Hope this helps.

Related

Issues with ordiplot3d NMDS in 3dvegan package

I am looking for some help here with this 3d NMDS code. I have 3 issues.
The layout of the plot moves significantly each time I execute the code.
The sites and species are sometimes far off of the plot.
The species text is often overlapping. How can I fix this?
I am unsure how to change the plotting environment to ggplot, so that might be out of the question.
library(vegan)
library(vegan3d)
library(tidyverse)
data("dune")
SiteID <- 1:20
NMDS = metaMDS(dune,distance="bray", try=500, wascores = TRUE, k=3)
NMDS1 = NMDS$points[,1]
NMDS2 = NMDS$points[,2]
NMDS3 = NMDS$points[,3]
NMDS = data.frame(NMDS1 = NMDS1, NMDS2 = NMDS2, NMDS3 = NMDS3, SiteID=SiteID)
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- with(NMDS, ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE))
sp <- scores(NMDS_input, choices=1:3, display="species", scaling="symmetric")
si <- scores(NMDS_input, choices=1:3, display="sites", scaling="symmetric")
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
sii <- as.data.frame(cbind(NMDS$SiteID,si))
with(NMDS, orditorp(pl4, labels = sii$V1, air=1, cex = 1))
labels must be character variables in orditorp. We always assumed so, but this was not checked in vegan::orditorp. Latest vegan version in github will take care of this and will also work with numeric labels.
ordiplot3d returns projected coordinates (in 2D) and if you want to plot those, you can just use the pl4 object that you saved and you do not need to use pl4$xyz.convert. This object will also be accepted in orditorp.
If you want to plot points that were not used in the original mock-3D plot, you must use pl4$xyz.convert for their 2D projection. This function will return the projected coordinates in a form that is directly accepted by standard R functions text, points (and some others), but they will not be accepted by orditorp (and I won't change this). You must make these into two-column matrix-like object; data.frame() will work.
Your example code contains a lot of un-needed code. The following is an edit with only necessary lines and fixes that make this example work with current vegan release.
library(vegan)
library(vegan3d)
data(dune)
SiteID <- as.character(1:20) # must be character
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE) # no with(NMDS,...)
sp <- scores(NMDS_input, choices=1:3, display="species") # no arg scaling in scores.metaMDS
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
orditorp(pl4, labels = SiteID, air=1, cex = 1) # character labels w/points in the same location

How To Use kmedoids from pyclustering with set number of clusters

I am trying to use k-medoids to cluster some trajectory data I am working with (multiple points along the trajectory of an aircraft). I want to cluster these into a set number of clusters (as I know how many types of paths there should be).
I have found that k-medoids is implemented inside the pyclustering package, and am trying to use that. I am technically able to get it to cluster, but I do not know how to control the number of clusters. I originally thought it was directly tied to the number of elements inside what I called initial_medoids, but experimentation shows that it is more complicated than this. My relevant code snippet is below.
Note that D holds a list of lists. Each list corresponds to a single trajectory.
def hausdorff( u, v):
d = max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
return d
traj_count = len(traj_lst)
D = np.zeros((traj_count, traj_count))
for i in range(traj_count):
for j in range(i + 1, traj_count):
distance = hausdorff(traj_lst[i], traj_lst[j])
D[i, j] = distance
D[j, i] = distance
from pyclustering.cluster.kmedoids import kmedoids
initial_medoids = [104, 345, 123, 1]
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()[0]
num_clusters = len(np.unique(cluster_lst))
print('There were %i clusters found' %num_clusters)
I have a total of 1900 trajectories, and the above-code finds 1424 clusters. I had expected that I could control the number of clusters through the length of initial_medoids, as I did not see any option to input the number of clusters into the program, but this seems unrelated. Could anyone guide me as to the mistake I am making? How do I choose the number of clusters?
In case of requirement to obtain clusters you need to call get_clusters():
cluster_lst = kmedoids_instance.get_clusters()
Not get_clusters()[0] (in this case it is a list of object indexes in the first cluster):
cluster_lst = kmedoids_instance.get_clusters()[0]
And that is correct, you can control amount of clusters by initial_medoids.
It is true you can control the number of cluster, which correspond to the length of initial_medoids.
The documentation is not clear about this. The get__clusters function "Returns list of medoids of allocated clusters represented by indexes from the input data". so, this function does not return the cluster labels. It returns the index of rows in your original (input) data.
Please check the shape of cluster_lst in your example, using .get_clusters() and not .get_clusters()[0] as annoviko suggested. In your case, this shape should be (4,). So, you have a list of four elements (clusters), each containing the index or rows in your original data.
To get, for example, data from the first cluster, use:
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()
traj_lst_first_cluster = traj_lst[cluster_lst[0]]

P-adjustment (FDR) on Hierarchical Clustering On Principle Components (HCPC) in R

I'm working right now with "Hierarchical Clustering On Principle Components (HCPC)". In the end of the analysis, p-values are computed by the HCPC function.
I searched but I couldn't find any function that could adjust the p-value based on FDR together with HCPC. It's really important to avoid any junk data in my multivariate set. Therefore my question is how can I run together with HCPC the p-value adjustment?
This is what I'm doing right now:
#install.packages(c("FactoMineR", "factoextra", "missMDA"))
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(missMDA)
library(data.table)
MyData <- fread('https://drive.google.com/open?
id=1y1YbIXtUssEBqmMSEbiQGcoV5j2Bz31k')
row.names(MyData) <- MyData$ID
MyData [1] <- NULL
Mydata_frame <- data.frame(MyData)
# Compute PCA with ncp = 3 (Variate based on the cluster number)
Mydata_frame.pca <- PCA(Mydata_frame, ncp = 2, graph = FALSE)
# Compute hierarchical clustering on principal components
Mydata.hcpc <- HCPC(Mydata_frame.pca, graph = FALSE)
Mydata.hcpc$desc.var$quanti
v.test Mean in category
Overall mean sd in category Overall sd p.value
CD8RAnegDRpos 12.965378 -0.059993483
-0.3760962775 0.46726224 0.53192037 1.922798e-38
TregRAnegDRpos 12.892725 0.489753272
0.1381306362 0.46877083 0.59502553 4.946490e-38
mTregCCR6pos197neg195neg 12.829277 1.107851623
0.6495813704 0.48972987 0.77933283 1.124088e-37
CD8posCCR6neg183neg194neg 12.667318 1.741757598
1.1735140264 0.45260338 0.97870842 8.972977e-37
mTregCCR6neg197neg195neg 12.109074 1.044905184
0.6408258230 0.51417779 0.72804665 9.455537e-34
CD8CD8posCD4neg 11.306215 0.724115486
0.4320918842 0.49823677 0.56351333 1.222504e-29
CD8posCCR6pos183pos194neg 11.226390 -0.239967805
-0.4982954123 0.49454619 0.50203520 3.025904e-29
TconvRAnegDRpos 11.011114 -0.296585038
-0.5279707475 0.44863446 0.45846770 3.378002e-28

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Filtering by Column in Matlab for a list or variety of values

I've produced a code which separates data within a text file into the required format, filters the data and averages the output (in this case, the value in the fourth column)
I am trying to filter the data in column one for a list of values at the same time, with no strict pattern for the values. e.g 1001, 1007, 1048, 1192, 1200 ....
Currently my code only filters by a certain value (1001) is there a way of incorporating a list of values into this function?
C_f = C(C(:,1) == 1001 , :);
Any help would be much appreciated!
See if this is what you want,
val = [1000 1001];
ind = ismember(C(:,1),val);
C_f = C(ind,:)