Possible to write Typology into dataset? - traminer

I am working with TraMineR and I am new to R and TraMineR.
Actually I made a typology of a life course dataset with TraMineR and the cluster library in R.
(used this guide: http://traminer.unige.ch/preview-typology.shtml)
Now I have different Cases sorted into different Types (all in all 4 Types).
I want to get into deeper analysis with a certain Type but I need to know which cases ( I have case numbers) belong to which type.
#
Is it possible to write the certain type a case is sorted to into the dataset itself as a new variable Is there another way?

In the example of the referenced guide, the Type is obtained as follows using an optimal matching distances with substitution costs based on transition probabilities
library(TraMineR)
data(mvad)
mvad.seq <- seqdef(mvad, 17:86)
dist.om1 <- seqdist(mvad.seq, method = "OM", indel = 1, sm = "TRATE")
library(cluster)
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
cl1.4 <- cutree(clusterward1, k = 4)
cl.4 is a vector with the cluster membership of the sequences in the order corresponding to the mvad dataset. (It could be convenient to transform it into a factor.) Therefore, you can simply add this variable as an additional column to the dataset. For instance, if we want to name this new column as Type
mvad$Type <- cl1.4
tail(mvad[,c("id","Type")]) ## id and Type of the last 6 sequences
## id Type
## 707 707 3
## 708 708 3
## 709 709 4
## 710 710 2
## 711 711 1
## 712 712 4

Related

kdb - how to create sum a list of dynamic columns using functional select

I want to be able to construct (+; (+; `a; `b); `c) given a list of `a`b`c
Similarly if I have a list of `a`b`c`d, I want to be able to construct another nest and so on and so fourth.
I've been trying to use scan but I cant get it right
q)fsum:(+;;)/
enlist[+;;]/
q)fsum `a`b`c`d
+
(+;(+;`a;`b);`c)
`d
If you only want the raw parse tree output, one way is to form the equivalent string and use parse. This isn't recommended for more complex examples, but in this case it is clear.
{parse "+" sv string x}[`a`b`c`d]
+
`d
(+;`c;(+;`b;`a))
If you are looking to use this in a functional select, we can use +/ instead of adding each column individually, like how you specified in your example
q)parse"+/[(a;b;c;d)]"
(/;+)
(enlist;`a;`b;`c;`d)
q)f:{[t;c] ?[t;();0b;enlist[`res]!enlist (+/;(enlist,c))]};
q)t:([]a:1 2 3;b:4 5 6;c:7 8 9;d:10 11 12)
q)f[t;`a`b`c]
res
---
12
15
18
q)f[t;`a`b]
res
---
5
7
9
q)f[t;`a`b`c]~?[t;();0b;enlist[`res]!enlist (+;(+;`a;`b);`c)]
1b
You can also get the sum by indexing directly to return a list of each column values and sum over these. We use (), to turn any input into a list, otherwise it will sum the values in that single column and return only a single value
q)f:{[t;c] sum t (),c}
q)f[t;`a`b`c]
12 15 18

Phylogenetic model using multiple entries for each species

I am relatively new to phylogenetic regression models. In the past I used PGLS when I had only 1 entry for each species in my tree. Now I have a dataset with thousands of records for a total of 9 species and I would like to run a phylogenetic model. I read the tutorial of the most common packages (e.g. caper) but I am unsure how to build the model.
When I try to create the object for caper, i.e. using:
obj <- comparative.data(phy = Tree, data = Data, names.col = species, vcv = TRUE, na.omit = FALSE, warn.dropped = TRUE)
I get the message:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘Species1’, ‘Species2’, ‘Species3’, ‘Species4’, ‘Species5’, ‘Species6’, ‘Species7’, ‘Species8’, ‘Species9’
I understood that I may solve this by applying a MCMCglmm model but I am unfamiliar with Bayesian models.
Thanks in advance for your help.
This is indeed not going to work with a simple PGLS from caper because it cannot deal with individuals as a random effect. I suggest you use MCMCglmm that is not much more complex to understand and will allow you to have individuals as a random effect. You can find excellent documentation from the package's author here or here or an alternative documentation that's more dealing with some specific aspects of the package (namely tree uncertainty) here.
Really briefly to get you going:
## Your comparative data
comp_data <- comparative.data(phy = my_tree, data =my_data,
names.col = species, vcv = TRUE)
Note that you can have a specimen column that can look like this:
taxa var1 var2 specimen
1 A 0.08730689 a spec1
2 B 0.47092692 a spec1
3 C -0.26302706 b spec1
4 D 0.95807782 b spec1
5 E 2.71590217 b spec1
6 A -0.40752058 a spec2
7 B -1.37192856 a spec2
8 C 0.30634567 b spec2
9 D -0.49828379 b spec2
10 E 1.42722363 b spec2
You can then set up your formula (similar to a simple lm formula):
## Your formula
my_formula <- variable1 ~ variable2
And your MCMC settings:
## Setting the prior list (see the MCMCglmm course notes for details)
prior <- list(R = list(V=1, nu=0.002),
G = list(G1 = list(V=1, nu=0.002)))
## Setting the MCMC parameters
## Number of interactions
nitt <- 12000
## Length of burnin
burnin <- 2000
## Amount of thinning
thin <- 5
And you should then be able to run a default MCMCglmm:
## Extracting the comparative data
mcmc_data <- comp_data$data
## As MCMCglmm requires a column named animal for it to identify it as a phylo
## model we include an extra column with the species names in it.
mcmc_data <- cbind(animal = rownames(mcmc_data), mcmc_data)
mcmc_tree <- comp_data$phy
## The MCMCglmmm
mod_mcmc <- MCMCglmm(fixed = my_formula,
random = ~ animal + specimen,
family = "gaussian",
pedigree = mcmc_tree,
data = mcmc_data,
nitt = nitt,
burnin = burnin,
thin = thin,
prior = prior)

kdb+/q: Apply iterative procedure with updated variable to a column

Consider the following procedure f:{[x] ..} with starting value a:0:
Do something with x and a. The output is saved as the new version of a, and the output is returned by the function
For the next input x, redo the procedure but now with the new a.
For a single value x, this procedure is easily constructed. For example:
a:0;
f:{[x] a::a+x; :a} / A simple example (actual function more complicated)
However, how do I make such a function such that it also works when applied on a table column?
I am clueless how to incorporate this step for 'intermediate saving of a variable' in a function that can be applied on a column at once. Is there a special technique for this? E.g. when I use a table column in the example above, it will simply calculate a+x with a:0 for all rows, opposed to also updating a at each iteration.
No need to use global vars for this - can use scan instead - see here.
Example --
Generate a table -
q)t:0N!([] time:5?.z.p; sym:5?`3; price:5?100f; size:5?10000)
time sym price size
-----------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994
2007.05.21D04:26:13.021438816 llm 7.347808 496
2010.10.30D10:15:14.157553088 obp 31.59526 1728
2005.11.01D21:15:54.022395584 dhc 34.10485 5486
2005.03.06D21:05:07.403334368 mho 86.17972 2318
Example with a simple accumilator - note, the function has access to the other args if needed (see next example):
q)update someCol:{[a;x;y;z] (a+1)}\[0;time;price;size] from t
time sym price size someCol
-------------------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994 1
2007.05.21D04:26:13.021438816 llm 7.347808 496 2
2010.10.30D10:15:14.157553088 obp 31.59526 1728 3
2005.11.01D21:15:54.022395584 dhc 34.10485 5486 4
2005.03.06D21:05:07.403334368 mho 86.17972 2318 5
Say you wanted to get cumilative size:
q)update cuSize:{[a;x;y;z] (a+z)}\[0;time;price;size] from t
time sym price size cuSize
------------------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994 3994
2007.05.21D04:26:13.021438816 llm 7.347808 496 4490
2010.10.30D10:15:14.157553088 obp 31.59526 1728 6218
2005.11.01D21:15:54.022395584 dhc 34.10485 5486 11704
2005.03.06D21:05:07.403334368 mho 86.17972 2318 14022
If you wanted more than one var passed through the scan, can pack more values into the first var, by giving it a more complex structure:
q)update cuPriceAndSize:{[a;x;y;z] (a[0]+y;a[1]+z)}\[0 0;time;price;size] from t
time sym price size cuPriceAndSize
--------------------------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994 29.07093 3994
2007.05.21D04:26:13.021438816 llm 7.347808 496 36.41874 4490
2010.10.30D10:15:14.157553088 obp 31.59526 1728 68.014 6218
2005.11.01D21:15:54.022395584 dhc 34.10485 5486 102.1188 11704
2005.03.06D21:05:07.403334368 mho 86.17972 2318 188.2986 14022
#MdSalih solution is correct, I am just explaining here what could be the possible reason with global variable in your case and solution for that.
q) t:([]id: 1 2)
q)a:1
I think you might have been using it like this:
q) select k:{x:x+a;a::a+1;:x} id from t
output:
k
--
1
2
And a value is 2 which means function executed only once. Reason is we passed full id column list to function and (+) is atomic which means it operates on full list at once. In following ex. 2 will get added to all items in list.
q) 2 + (1;3;5)
Correct way to use it is 'each':
q)select k:{x:x+a;a::a+1;:x} each id from t
output:
k
--
2
3

Delete rows from a cell given a specific condition

I have a cell type big-variable sorted out by FIRM (A(:,2)) and I want to erase all the rows in which the same firm doesn't appear at least 3 times in a row. In this example, A:
FIRM
1997 'ABDR' 0,56 464 1641 19970224
1997 'ABDR' 0,65 229 9208 19970424
1997 'ABDR' 0,55 125 31867 19970218
1997 'ABD' 0,06 435 8077 19970311
1997 'ABD' 0,00 150 44994 19970804
1997 'ABFI' 2,07 154 46532 19971209
I would keep only A:
1997 'ABDR' 0,56 464 1641 19970224
1997 'ABDR' 0,65 229 9208 19970424
1997 'ABDR' 0,55 125 31867 19970218
Thanks a lot.
Notes:
I used fopen and textscanto import the csv file.
I performed some changes on some variables for all of them to fit in a cell-type variable
I converted some number-elements into stings
F_x=num2cell(Data{:,x});
I got new variable just with year
F_ya=max(0,fix(log10(F_y)+1)-4);
F_yb=fix(F_y./10.^F_ya);
F_yc = num2cell(F_yb);
Create new cell A w/ variables I need
A=[F_5C Data{:,1} Data{:,2} Data{:,3} Data{:,4} F_xa F_xb];
Meaning that within the cell I have some variables that are strings and others that are numbers.
I'm going to assume that your names are stored in a cell array. As such, your names would actually be:
names = {'ABDR', 'ABDR', 'ABDR', 'ABD', 'ABD', 'ABFI'};
We can then use strcmpi. What this function does is that it string compares two strings together. It returns true if the strings match and false otherwise. This is also case insensitive, so ABDR would be the same as abdr.
You would call strcmpi like so:
v = strcmpi(str1, str2);
Alternatively str2 can be a cell array. How this would work is that it would take a single string str1 and compare with each string in each cell of the cell array. It would then return a logical vector that is the same size as str2 which indicates whether we have a match at this particular location or not.
As such, we can go through each element of names and see how many matches we have overall with the entire names cell array. We can then figure out which locations we need to select by checking to see if we have at least 3 matches or more per name in the names array. In other words, we simply sum up the logical vector for each string within names and filter those that sum up to 3 or more. We can use cellfun to help us perform this. As such:
sums = cellfun(#(x) sum(strcmpi(x,names)), names);
Doing this thus gives:
sums =
3 3 3 2 2 1
Now, we need those locations that have three or more. As such:
locations = sums >= 3
locations =
1 1 1 0 0 0
As such, these are the rows that you can use to filter out your matrix. This is also a logical vector. Assuming that A contains your data, you would simply do A(locations,:) to filter out all those rows that have occurrences of three or more times for a particular name. I really don't know how you constructed A, so I'm assuming it's like a 2D matrix. If you put in the code that you used to construct this matrix, I'll modify my post to get it working for you. In any case, what's important is locations. This tells you what rows you need to select to match your criteria.

Change the class of columns in a data frame

First of all, excuse me if I do any mistakes, but English is not a language I use very often.
I have a data frame with numbers. A small part of the data frame is this:
nominal 2
2
2
2
ordinal
2
1
1
2
So, I want to use the gower distance function on these numbers.
Here ( http://rgm2.lab.nig.ac.jp/RGM2/R_man-2.9.0/library/StatMatch/man/gower.dist.html ) says that in order to use gower.dist, all nominal variables must be of class "factor" and all ordinal variables of class "ordered".
By default, all the columns are of class "integer" and mode "numeric". In order to change the class of the columns, i use these commands:
DF=read.table("clipboard",header=TRUE,sep="\t")
# I select all the cells and I copy them to the clipboard.
#Then R, with this command, reads the data from there.
MyHeader=names(DF) # I save the headers of the data frame to a temp matrix
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="nominal") DF[[i]]=as.factor(DF[[i]])
}
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="ordinal") DF[[i]]=as.ordered(DF[[i]])
}
The first for/if loop changes the class from integer to factor, which is what I want, but the second changes the class of ordinal variables to: "ordered" "factor".
I need to change all the columns with the header "ordinal" to "ordered", as the gower.dist function says.
Thanks in advance,
B.T.
What you are doing is fine --- if perhaps a little inelegantly.
With your ordered factor, you have something like:
> foo <- as.ordered(1:10)
> foo
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10
> class(foo)
[1] "ordered" "factor"
Notice that it has two classes, indicating that it is an ordered factor and that is is a factor:
> is.ordered(as.ordered(1:10))
[1] TRUE
> is.factor(as.ordered(1:10))
[1] TRUE
In some senses, you might like to think that foo is an ordered factor but also inherits from the factor class too. Alternatively, if there isn't a specific method that handles ordered factors, but there is a method for factors, R will use the factor method. As far as R is concerned, an ordered factor is an object with classes "ordered" and "factor". This is what your function for Gower's distance will require.
You could easily do this with:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
which gives you a dataframe with the correct structure. If you work with data frames, please stay away from [[]] unless you know very well what you're doing. Take Dirks advice, and check Owen's R Guide as well. You definitely need it.
If i do the conversion as I showed above, gower.dist() works perfectly fine. On a sidenote, the gowers distance can easily be calculated using the daisy() function as well:
DF <- data.frame(
ordinal= c(1,2,3,1,2,1),
nominal= c(2,2,2,2,2,2)
)
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
library(cluster)
daisy(DF,metric="gower")
library(StatMatch)
gower.dist(DF)