Where in the sequence of a Probabilistic Suffix Tree does "e" occur? - markov-chains

In my data there are only missing data (*) on the right side of the sequences. That means that no sequence starts with * and no sequence has any other markers after *. Despite this the PST (Probabilistic Suffix Tree) seems to predict a 90% chance of starting with a *. Here's my code:
# Load libraries
library(RCurl)
library(TraMineR)
library(PST)
# Get data
x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)
# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)
# Create sequence object
data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")
# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)
# Look at first state
cmine(S1, pmin = 0, state = "N3", l = 1)
This generates:
[>] context: e
EX FA I1 I2 I3 N1 N2 N3 NR
S1 0.006821066 0.01107234 0.01218274 0.01208756 0.006821066 0.002569797 0.003299492 0.001554569 0.0161802
QU TR *
S1 0.01126269 0.006440355 0.9097081
How can the probability for * be 0.9097081 at the very beginning of the sequence, meaning after context e?
Does it mean that the context can appear anywhere inside a sequence, and that e denotes an arbitrary starting point somewhere inside a sequence?

A PST is a representation of a variable length Markov model (VLMC). As a classical Markov model a VLMC is assumed to be homogeneous (or stationary) meaning that the conditional probabilities of the outcome given the context are the same at each position in the sequence. In other words, the context can appear anywhere in the sequence. Actually, the search for contexts is done by exploring the tree that is supposed to apply anywhere in the sequences.
In your example, for l=1 (l is 1 + the length of the context), you look only for 0-length context, i.e., the only possible context is the empty sequence e. Your condition pmin=0, state=N3 (have a probability greater than 0 for N3) is equivalent to no condition at all. So you get the overall probability to observe each state. Because your sequences (with the missing states) are all of the same length, you would get the same results using TraMineR with
seqmeant(data.seq, with.missing=TRUE)/max(seqlength(data.seq))
To get the distribution at the first position, you can use TraMineR and look at the first column of the table of cross-sectional distributions at the successive positions returned by
seqstatd(data.seq, with.missing=TRUE)
Hope this helps.

Related

good hash function for a string of numbers and letters of the form "9X9XX99X9XX999999"

what would be a good hash code for a vehicle identification number, that is a string
of numbers and letters of the form "9X9XX99X9XX999999," where a "9" represents
a digit and an "X" represents a letter?
One reasonable approach is to hash the entire thing using a hash function suitable for strings, e.g. GCC's C++ Standard Library uses MURMUR32.
If you wanted to get more hands on, you could group all the digits to form one 11-digit number, and knowing the 6 letters can have 26 different values which is less than 2^5=32 - you could cheaply create a number from those letters (let's call them ABCDEF) by evaluating: A + B * 2^5 + C * 2^10 + D * 2^15 + E * 2^20 + F * 2^25
Then, separately hash both the 11-digit number and the number created from the letters with a decent hash function, and XOR or add the results; you'll have quite a good hash value for your VIN. I haven't personally evaluated it, but Thomas Mueller recommends and explains something ostensible suitable here:
uint64_t hash(uint64_t x) {
x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
x = x ^ (x >> 31);
return x;
}

NFA to accept the following language

I need to build an NFA (or DFA) to recognize the following language:
L = {w | w mod 3 = 1}.
So the way I tried it was to make an NFA to recognize numbers divisible by 3 and then just add 1 to them, but this approach is a lot harder than it seems (if not impossible ?).
I only managed to do an NFA to recognize numbers divisible by 3.
I will assume that w is to be interpreted as the decimal representation (without leading zeroes) of a nonnegative integer.
Given this, we can use Myhill-Nerode to iteratively determine the states we need:
the empty string can be followed by any string in L to get to a string in L. We'll call the equivalence class for this [e]. Note that this equivalence class corresponds to the initial state of a minimal DFA for L (if one exists). Note also that the initial state is not accepting since the empty string is not a valid decimal representation of a nonnegative integer.
the string 0 cannot be followed by anything to get a string in L; it leads to a dead state corresponding to equivalence class [0].
strings 1, 4 and 7 are in L so they must correspond to a new state. We'll call the equivalence class for these [1].
strings 2, 5 and 8 are not in L; however, not all strings in L lead them to strings in L. These must correspond to a new equivalence class we'll call [2].
strings 3, 6 and 9 are not in L; but these can be followed by anything in L to get a string in L. This is the same as the empty string, so we don't need a new equivalence class or state: the equivalence class is [e].
it can be verified that every two-digit decimal string is indistinguishable from some one-digit decimal string above. so, no new equivalence classes or states are needed.
To determine the transitions, simply append the transition symbol to the equivalence class's representative element and see what equivalence class the resulting string belongs to: that will be where the transition terminates. For instance, there is a transition from [e] to [0] on 0, from [e] to [1] on 1, etc.
Because 10 = 1 (mod 3), adding a new digit to the end of a decimal string will cause the new value modulo 3 to be the sum of the original number's value modulo 3 with the value of the new digit modulo 3:
x = a (mod 3)
y = b (mod 3)
x * 10 = x * 1 (mod 3) since 10 = 1 (mod 3)
x . y = x * 10 + y = x * 1 + y = x + y (mod 3)
Filling in the transitions is left as an exercise.

Linear Regression with Apache Beam

How might one go about fitting a large number of linear regressions in a beam pipeline? I have a large csv, and I want to normalize every column (about 500) according to two columns A and B. That is, I would like to get standard residuals for X ~ A + B for each column in the csv X.
That's an interesting use case. You can do something like so:
INDEX_A = # Something
INDEX_B = # Something else
parsed_rows = pipeline | beam.ReadFromText(my_csv)
| beam.Map(parse_each_line)
def column_paired_rows(row):
for idx, val in row:
if idx in (INDEX_A, INDEX_B): continue
# Yield the values keyed with the independent + dependent variable indices
yield ((INDEX_A, idx), {'independent_var_value': row[INDEX_A],
'independent_var_idx': INDEX_A,
'dependent_var_value': val,
'dependent_var_idx': idx})
yield ((INDEX_B, idx), {'independent_var_value': row[INDEX_B],
'independent_var_idx': INDEX_B,
'dependent_var_value': val,
'dependent_var_idx': idx})
column_pairs = parsed_rows | beam.FlatMap(column_paired_rows) | beam.GroupByKey()
The column_pairs PCollection will group all your elements by independent, dependent variable pairs, and then you can run the analysis.
def perform_linear_regression(elm):
key = elm[0] # KEY is a tuple with (independent variable index, dependent variable index)
values = elm[1] # This is an iterable with the data points that you need.
pairs = [(v['independent_var_value'], v['dependent_var_value']) for v in values]
model = linear_regression(pairs)
return (key, model)
models = column_pairs | beam.Map(perform_linear_regression)
LMK if you'd like me to add further detail

Convert matlab symbol to array of products

Can I convert a symbol that is a product of products into an array of products?
I tried to do something like this:
syms A B C D;
D = A*B*C;
factor(D);
but it doesn't factor it out (mostly because that isn't what factor is designed to do).
ans =
A*B*C
I need it to work if A B or C is replaced with any arbitrarily complicated parenthesized function, and it would be nice to do it without knowing what variables are in the function.
For example (all variables are symbolic):
D = x*(x-1)*(cos(z) + n);
factoring_function(D);
should be:
[x, x-1, (cos(z) + n)]
It seems like a string parsing problem, but I'm not confident that I can convert back to symbolic variables afterwards (also, string parsing in matlab sounds really tedious).
Thank you!
Use regexp on the string to split based on *:
>> str = 'x*(x-1)*(cos(z) + n)';
>> factors_str = regexp(str, '\*', 'split')
factors_str =
'x' '(x-1)' '(cos(z) + n)'
The result factor_str is a cell array of strings. To convert to a cell array of sym objects, use
N = numel(factors_str);
factors = cell(1,N); %// each cell will hold a sym factor
for n = 1:N
factors{n} = sym(factors_str{n});
end
I ended up writing the code to do this in python using sympy. I think I'm going to port the matlab code over to python because it is a more preferred language for me. I'm not claiming this is fast, but it serves my purposes.
# Factors a sum of products function that is first order with respect to all symbolic variables
# into a reduced form using products of sums whenever possible.
# #params orig_exp A symbolic expression to be simplified
# #params depth Used to control indenting for printing
# #params verbose Whether to print or not
def factored(orig_exp, depth = 0, verbose = False):
# Prevents sympy from doing any additional factoring
exp = expand(orig_exp)
if verbose: tabs = '\t'*depth
terms = []
# Break up the added terms
while(exp != 0):
my_atoms = symvar(exp)
if verbose:
print tabs,"The expression is",exp
print tabs,my_atoms, len(my_atoms)
# There is nothing to sort out, only one term left
if len(my_atoms) <= 1:
terms.append((exp, 1))
break
(c,v) = collect_terms(exp, my_atoms[0])
# Makes sure it doesn't factor anything extra out
exp = expand(c[1])
if verbose:
print tabs, "Collecting", my_atoms[0], "terms."
print tabs,'Seperated terms with ',v[0], ', (',c[0],')'
# Factor the leftovers and recombine
c[0] = factored(c[0], depth + 1)
terms.append((v[0], c[0]))
# Combines trivial terms whenever possible
i=0
def termParser(thing): return str(thing[1])
terms = sorted(terms, key = termParser)
while i<len(terms)-1:
if equals(terms[i][1], terms[i+1][1]):
terms[i] = (terms[i][0]+terms[i+1][0], terms[i][1])
del terms[i+1]
else:
i += 1
recombine = sum([terms[i][0]*terms[i][1] for i in range(len(terms))])
return simplify(recombine, ratio = 1)

Best way to create generic/method consistency for sort.data.frame?

I've finally decided to put the sort.data.frame method that's floating around the internet into an R package. It just gets requested too much to be left to an ad hoc method of distribution.
However, it's written with arguments that make it incompatible with the generic sort function:
sort(x,decreasing,...)
sort.data.frame(form,dat)
If I change sort.data.frame to take decreasing as an argument as in sort.data.frame(form,decreasing,dat) and discard decreasing, then it loses its simplicity because you'll always have to specify dat= and can't really use positional arguments. If I add it to the end as in sort.data.frame(form,dat,decreasing), then the order doesn't match with the generic function. If I hope that decreasing gets caught up in the dots `sort.data.frame(form,dat,...), then when using position-based matching I believe the generic function will assign the second position to decreasing and it will get discarded. What's the best way to harmonize these two functions?
The full function is:
# Sort a data frame
sort.data.frame <- function(form,dat){
# Author: Kevin Wright
# http://tolstoy.newcastle.edu.au/R/help/04/09/4300.html
# Some ideas from Andy Liaw
# http://tolstoy.newcastle.edu.au/R/help/04/07/1076.html
# Use + for ascending, - for decending.
# Sorting is left to right in the formula
# Useage is either of the following:
# sort.data.frame(~Block-Variety,Oats)
# sort.data.frame(Oats,~-Variety+Block)
# If dat is the formula, then switch form and dat
if(inherits(dat,"formula")){
f=dat
dat=form
form=f
}
if(form[[1]] != "~") {
stop("Formula must be one-sided.")
}
# Make the formula into character and remove spaces
formc <- as.character(form[2])
formc <- gsub(" ","",formc)
# If the first character is not + or -, add +
if(!is.element(substring(formc,1,1),c("+","-"))) {
formc <- paste("+",formc,sep="")
}
# Extract the variables from the formula
vars <- unlist(strsplit(formc, "[\\+\\-]"))
vars <- vars[vars!=""] # Remove spurious "" terms
# Build a list of arguments to pass to "order" function
calllist <- list()
pos=1 # Position of + or -
for(i in 1:length(vars)){
varsign <- substring(formc,pos,pos)
pos <- pos+1+nchar(vars[i])
if(is.factor(dat[,vars[i]])){
if(varsign=="-")
calllist[[i]] <- -rank(dat[,vars[i]])
else
calllist[[i]] <- rank(dat[,vars[i]])
}
else {
if(varsign=="-")
calllist[[i]] <- -dat[,vars[i]]
else
calllist[[i]] <- dat[,vars[i]]
}
}
dat[do.call("order",calllist),]
}
Example:
library(datasets)
sort.data.frame(~len+dose,ToothGrowth)
Use the arrange function in plyr. It allows you to individually pick which variables should be in ascending and descending order:
arrange(ToothGrowth, len, dose)
arrange(ToothGrowth, desc(len), dose)
arrange(ToothGrowth, len, desc(dose))
arrange(ToothGrowth, desc(len), desc(dose))
It also has an elegant implementation:
arrange <- function (df, ...) {
ord <- eval(substitute(order(...)), df, parent.frame())
unrowname(df[ord, ])
}
And desc is just an ordinary function:
desc <- function (x) -xtfrm(x)
Reading the help for xtfrm is highly recommended if you're writing this sort of function.
There are a few problems there. sort.data.frame needs to have the same arguments as the generic, so at a minimum it needs to be
sort.data.frame(x, decreasing = FALSE, ...) {
....
}
To have dispatch work, the first argument needs to be the object dispatched on. So I would start with:
sort.data.frame(x, decreasing = FALSE, formula = ~ ., ...) {
....
}
where x is your dat, formula is your form, and we provide a default for formula to include everything. (I haven't studied your code in detail to see exactly what form represents.)
Of course, you don't need to specify decreasing in the call, so:
sort(ToothGrowth, formula = ~ len + dose)
would be how to call the function using the above specifications.
Otherwise, if you don't want sort.data.frame to be an S3 generic, call it something else and then you are free to have whatever arguments you want.
I agree with #Gavin that x must come first. I'd put the decreasing parameter after the formula though - since it probably isn't used that much, and hardly ever as a positional argument.
The formula argument would be used much more and therefore should be the second argument. I also strongly agree with #Gavin that it should be called formula, and not form.
sort.data.frame(x, formula = ~ ., decreasing = FALSE, ...) {
...
}
You might want to extend the decreasing argument to allow a logical vector where each TRUE/FALSE value corresponds to one column in the formula:
d <- data.frame(A=1:10, B=10:1)
sort(d, ~ A+B, decreasing=c(A=TRUE, B=FALSE)) # sort by decreasing A, increasing B