Logistic Regression(Classification Technique) on Time-dependent Predictors/variables Data - classification

I would like to know if I can apply the classification techniques, like say Logistic Regression, to data whose variables/predictors are 'indexed' by time. Or if not, what classification techniques are appropriate to use in these kinds of data.
To give you a clear picture of the problem, say I have a dependent variable Y, whose values are 0 or 1 (for binary case classification), or 1,2,3,... (for 'multi' classification).
And I have predictor variables which are 'indexed' by time, i.e., X1T1, X1T2,...,X1Tn, X2T1, X2T2,..., X2Tm,....XpTk,
where
X1T1 = values of variable X1 at time 1 (T1)
X1T2 = values of variable X1 at time 2 (T2)
.
.
X1Tn = values of variable X1 at time n (Tn)
X2T1 = values of variable X2 at time 1 (T1)
X2T2 = values of variable X2 at time 2 (T2)
.
.
X2Tm = values of variable X2 at time m (Tm)
.
.
.
XpTk = values of variable Xp at time k (Tk)
where n,m,k = 1,2,... (variable time 'index')
p =1,2,.... (# of predictor variables).
For the data view, we have;
Obs Y X1T1 X1T2 ... X1Tn X2T1 X2T3 ... X2Tm ... XpTk
1 . . . . . . . .
2 . . . . . . . ... .
.
.
.
N . . . . . . . ... .
Can I apply a classification technique, like say, Logistic Regression on these types of data (or other classification techniques for 'multi' category response variable like tree based methods.) Thanks a lot!

You could use your data to fit Logistic Regression model.
However, the result may not so good because the algorithm would consider each variable independently.
One way to use your data to fit LR model is to make new variables which are okay to be considered that they are independent to each other and keep enough information to represent your original data.
For example)
newvar1 = mean(X1T1, X1T2, ..., X1Tn),
newvar2 = sd(X1T1, X1T2, ..., X1Tn),
newvar3 = mean increasing ratio(X1T1, X1T2, ..., X1Tn)
...
This way, you can use your data to fit LR model, even though the examples of new variables would not be enough.

Related

Nnet in caret, basic structure

I'm very new to caret package and nnet in R. I've done some projects related to ANN with Matlab before, but now I need to work with R and I need some basic help.
My input dataset has 1000 observations (in rows) and 23 variables (in columns). My output has 1000 observations and 12 variables.
Here are some sample data that represent my dataset and might help to understand my problem better:
input = as.data.frame(matrix(sample(1 : 20, 100, replace = TRUE), ncol = 10))
colnames(input) = paste ( "X" , 1:10, sep = "") #10 observations and 10 variables
output = as.data.frame(matrix(sample(1 : 20, 70, replace = TRUE), ncol = 7))
colnames(output) = paste ( "Y" , 1:7, sep = "") #10 observations and 7 variables
#nnet with caret:
net1 = train(output ~., data = input, method= "nnet", maxit = 1000)
When I run the code, I get this error:
error: invalid type (list) for variable 'output'.
I think I have to add all output variables separately (which is very annoying, especially with a lot of variables), like this:
train(output$Y1 + output$Y2 + output$Y3 + output$Y4 + output$Y5 +
output$Y6 + output$Y7 ~., data = input, method= "nnet", maxit = 1000)
This time it runs but I get this error:
Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
I try to use neuralnet package, with the code below it works perfectly but I still have to add output variables separately :(
net1 = neuralnet(output$Y1 + output$Y2 + output$Y3 + output$Y4 +
output$Y5 + output$Y6 + output$Y7 ~., data = input, hidden=c(2,10))
p.s. since these sample data are created randomly, the neuralnet cannot converge, but in my real data it works well (in comparison to Matlab ANN)
Now, if you could help me with a way to put output variables automatically (not manually), it solves my problem (although with neuralnet not caret).
use the str() function and ascertain that its a data frame looks like you are inputting a list to the train function. This may be because of a transformation you are doing before to output.
str(output)
Without a full script of earlier steps its difficult to understand what is going on.
After trying different things and searches, I finally found a solution:
First, we must use as.formula to show the relation between our input and output. With the code below we don't need to add all the variables separately:
names1 <- colnames(output) #the name of our variables in the output
names2 = colnames(input) #the name of our variables in the input
a <- as.formula(paste(paste(names1,collapse='+', sep = ""),' ~ '
,paste(names2,collapse='+', sep = "")))
then we have to combine our input and output in a single data frame:
all_data = cbind(output, input)
then, use neuralnet like this:
net1 = neuralnet(formula = a, data = all_data, hidden=c(2,10))
plot(net1)
This is also work with the caret package:
net1 = train(a, data = all_data, method= "nnet", maxit = 1000)
but it seems neuralnet works faster (at least in my case).
I hope this helps someone else.

Workaround for dynamically generating a structure - Matlab/Octave issues

I have many, many sets of data in .csv format that I've organized by a file name standard so I can use regular expressions for the second time ever. I have, however, run into a slight problem. My data files are titled things like, "2012001_C335_2000MHZ_P_1111.CSV". There are four years of interest, two frequencies, and four different C335-style labels to describe locations. I have a significant amount of data processing to do on each of these files, so I'd like to read them all into one giant structure and then do my processing on different parts of it. I'm writing:
for ix_id = 1:length(ids)
for ix_years = 1:2:length(ids_years{ix_id})
for ix_frq = 1:length(frqs)
st = [ids_years{ix_id}{ix_year} '_' ids{ix_id} '_' ids_frqs{ix_id}{ix_frq}'_P_1111.CSV'];
data.(ids_frqs{ix_id}{ix_frq}).(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}]) =...
dlmread(st);
end
end
end
All ids variables are 1x4 cell arrays where each cell contains strings.
This produces the errors:
"Error: a cs-list cannot be further indexed"
and
"Error: invalid assignment to cs-list outside multiple assignment"
I did an internet search for these errors and found a few posts with dates ranging from 2010 to 2012, such as this one and this one, where the author suggests it's a problem with Octave itself. I can do a workaround that involves defining two separate structures by removing the innermost for loop over ix_frq and replacing the lines beginning with "st" and "data" with
data.1500.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}]) = ...
dlmread([ids_years{ix_id}{ix_year} '_' ids{ix_id} '_' ids{ix_id} '_1500MHZ_P_1111.CSV']);
data.2000.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}]) = ...
dlmread([ids_years{ix_id}{ix_year} '_' ids{ix_id} '_' ids{ix_id} '_2000MHZ_P_1111.CSV']);
so it seems that the trouble arises when I try to make a more nested structure. I'm wondering if this is unique to Octave or the same in Matlab, and also if there's a slicker workaround than defining two separate structures since I'd like this to be as portable as possible. If you have any insight as to the meaning of the error message, I'm interested in that too. Thanks!
EDIT: Here is the full script - now generates a few dummy .csv files. Runs on Octave v. 3.8
clear all
%this program tests the creation of various structures. The end goal is to have a structure of the format frequency.beamname.year(1) = matrix of the appropriate file
A = rand(3,2);
csvwrite('2009103_C115_1500MHZ.CSV',A)
csvwrite('2009103_C115_2000MHZ.CSV',A)
csvwrite('2010087_C115_1500MHZ.CSV',A)
csvwrite('2010087_C115_2000MHZ.CSV',A)
csvwrite('2009103_C335_1500MHZ.CSV',A)
csvwrite('2009103_C335_2000MHZ.CSV',A)
csvwrite('2010087_C335_1500MHZ.CSV',A)
csvwrite('2010087_C335_2000MHZ.CSV',A)
data = dir('*.CSV'); %imports all of the files of a directory
files = {data.name}; %cell array of filenames
nfiles = numel(files);
%find all the years
years = unique(cellfun(#(x)x{1},regexp(files,'\d{7}','match'),'UniformOutput',false));
%find all the beam names
ids = unique(cellfun(#(x)x{1},regexp(files,'([C-I]\d{3})|([C-I]\d{1}[C-I]\d{2})','match'),'UniformOutput',false));
%find all the frequencies
frqs = unique(cellfun(#(x)x{1},regexp(files,'\d{4}MHZ','match'),'UniformOutput',false));
%now, vectorize to cover all the beams
for id_ix = 1:length(ids)
expression_yrs = ['(\d{7})(?=_' ids{id_ix} ')'];
listl_yrs = regexp(files,expression_yrs,'match');
ids_years{id_ix} = unique(cellfun(#(x)x{1},listl_yrs(cellfun(#(x)~isempty(x),listl_yrs)),'UniformOutput',false)); %returns the years for data collected with both the 1500 and 2000 MHZ antennas along each of thebeams
expression_frqs = ['(?<=' ids{id_ix} '_)(\d{4}MHZ)'];
listfrq = regexp(files,expression_frqs,'match'); %finds every frequency that was collected for C115, C335
ids_frqs{id_ix} = unique(cellfun(#(x)x{1},listfrq(cellfun(#(x)~isempty(x),listfrq)),'UniformOutput',false));
end
%% finally, dynamically generate a structure data.Beam.Year.Frequency
%this works
for ix_id = 1:length(ids)
for ix_year = 1:length(ids_years{ix_id})
data1500.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}])=dlmread([ids_years{ix_id}{ix_year} '_' ids{ix_id} '_' ids_frqs{1}{1} '.CSV']);
data2000.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}])=dlmread([ids_years{ix_id}{ix_year} '_' ids{ix_id} '_' ids_frqs{1}{2} '.CSV']);
end
end
%this doesn't work
for ix_id=1:length(ids)
for ix_year=1:length(ids_years{ix_id})
for ix_frq = 1:numel(frqs)
data.(['F' ids_frqs{ix_id}{ix_frq}]).(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}])=dlmread([ids_years{ix_id}{ix_year} '_' ids{ix_id} '_' ids_frqs{ix_id}{ix_frq} '.CSV']);
end
end
end
Hopefully, that helps clarify the question - I am not sure of the etiquette here with posting edits and code.
The problem is that when you get to the for loop that is causing a problem, data already exists and is a struct array.
octave> data
data =
8x1 struct array containing the fields:
name
date
bytes
isdir
datenum
statinfo
When you select a field from a struct array you will get a cs-list (comma-separate list) unless you also index which of the structs in the struct array. See:
octave> data.name
ans = 2009103_C115_1500MHZ.CSV
ans = 2009103_C115_2000MHZ.CSV
ans = 2009103_C335_1500MHZ.CSV
ans = 2009103_C335_2000MHZ.CSV
ans = 2010087_C115_1500MHZ.CSV
ans = 2010087_C115_2000MHZ.CSV
ans = 2010087_C335_1500MHZ.CSV
ans = 2010087_C335_2000MHZ.CSV
octave> data(1).name
ans = 2009103_C115_1500MHZ.CSV
So when you do:
data.(...) = dlmread (...);
you don't get what you were expecting on the left hand side, you will get a cs-list. But I'm guessing this is accidental, since data at the moment only has filenames, so simply create a new empty struct:
data = struct (); # this will clear your previous data
for ix_id=1:length(ids)
for ix_year=1:length(ids_years{ix_id})
for ix_frq = 1:numel(frqs)
data.(['F' ids_frqs{ix_id}{ix_frq}]).(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}])=dlmread([ids_years{ix_id}{ix_year} '_' ids{ix_id} '_' ids_frqs{ix_id}{ix_frq} '.CSV']);
end
end
end
I would also recommend to think better about your current solution. This code looks overcomplicated to me.

Convert matlab symbol to array of products

Can I convert a symbol that is a product of products into an array of products?
I tried to do something like this:
syms A B C D;
D = A*B*C;
factor(D);
but it doesn't factor it out (mostly because that isn't what factor is designed to do).
ans =
A*B*C
I need it to work if A B or C is replaced with any arbitrarily complicated parenthesized function, and it would be nice to do it without knowing what variables are in the function.
For example (all variables are symbolic):
D = x*(x-1)*(cos(z) + n);
factoring_function(D);
should be:
[x, x-1, (cos(z) + n)]
It seems like a string parsing problem, but I'm not confident that I can convert back to symbolic variables afterwards (also, string parsing in matlab sounds really tedious).
Thank you!
Use regexp on the string to split based on *:
>> str = 'x*(x-1)*(cos(z) + n)';
>> factors_str = regexp(str, '\*', 'split')
factors_str =
'x' '(x-1)' '(cos(z) + n)'
The result factor_str is a cell array of strings. To convert to a cell array of sym objects, use
N = numel(factors_str);
factors = cell(1,N); %// each cell will hold a sym factor
for n = 1:N
factors{n} = sym(factors_str{n});
end
I ended up writing the code to do this in python using sympy. I think I'm going to port the matlab code over to python because it is a more preferred language for me. I'm not claiming this is fast, but it serves my purposes.
# Factors a sum of products function that is first order with respect to all symbolic variables
# into a reduced form using products of sums whenever possible.
# #params orig_exp A symbolic expression to be simplified
# #params depth Used to control indenting for printing
# #params verbose Whether to print or not
def factored(orig_exp, depth = 0, verbose = False):
# Prevents sympy from doing any additional factoring
exp = expand(orig_exp)
if verbose: tabs = '\t'*depth
terms = []
# Break up the added terms
while(exp != 0):
my_atoms = symvar(exp)
if verbose:
print tabs,"The expression is",exp
print tabs,my_atoms, len(my_atoms)
# There is nothing to sort out, only one term left
if len(my_atoms) <= 1:
terms.append((exp, 1))
break
(c,v) = collect_terms(exp, my_atoms[0])
# Makes sure it doesn't factor anything extra out
exp = expand(c[1])
if verbose:
print tabs, "Collecting", my_atoms[0], "terms."
print tabs,'Seperated terms with ',v[0], ', (',c[0],')'
# Factor the leftovers and recombine
c[0] = factored(c[0], depth + 1)
terms.append((v[0], c[0]))
# Combines trivial terms whenever possible
i=0
def termParser(thing): return str(thing[1])
terms = sorted(terms, key = termParser)
while i<len(terms)-1:
if equals(terms[i][1], terms[i+1][1]):
terms[i] = (terms[i][0]+terms[i+1][0], terms[i][1])
del terms[i+1]
else:
i += 1
recombine = sum([terms[i][0]*terms[i][1] for i in range(len(terms))])
return simplify(recombine, ratio = 1)

Print the name of a variable on upon a plot/figure

Is it possible to refer back to/access the names of variables (say nx1 arrays) that make up a matrix? I wish to access them to insert there names into a plot or figure (as a text) that I have created. Here is an example:
A = [supdamp, clgvlv,redamp,extfanstat,htgvlv,occupied,supfanspd]
%lots of code here but not changing A, just using A(:,:)'s
%drawn figure
text(1,1,'supdamp')
...
text(1,n,'supfanspd')
I have failed in an attempt create a string named a with their names in so that I could loop through a(i,1), then use something like text(1,n,'a(i,1)')
Depending on your problem, it might make sense to use structures with dynamical field names.
Especially if your data in the array have some meaning other than just entries of a matrix in linear algebra sense.
# name your variables so that your grandma could understand what they store
A.('supdamp') = supdamp
A.('clgvlv') = clgvlv
...
fieldsOfA = fieldnames(a)
for n = 1 : numel(fieldsOfA )
text(1, n, fieldsOfA{n})
end

Bootstrapping Stepwise Regression in Stata

I'm trying to bootstrap a stepwise regression in Stata and extract the bootstrapped coefficients. I have two separate ado files. sw_pbs is the command the user uses, which calls the helper command sw_pbs_simulator.
program define sw_pbs, rclass
syntax varlist, [reps(integer 100)]
simulate _b, reps(`reps') : sw_pbs_simulator `varlist'
end
program define sw_pbs_simulator, rclass
syntax varlist
local depvar : word 1 of `varlist'
local indepvar : list varlist - depvar
reg `depvar' `indepvar'
local rmse = e(rmse)
matrix b_matrix = e(b)'
gen col_of_ones = 1
mkmat `indepvar' col_of_ones, mat(x_matrix)
gen errs = rnormal(0, `rmse')
mkmat errs, mat(e_matrix)
matrix y = x_matrix * b_matrix + e_matrix
svmat y
sw reg y `indepvar', pr(0.10) pe(0.05)
drop col_of_ones errs y
end
The output is a data set of the bootstrapped coefficients. My problem is that the output seems to be dependent on the result of the first stepwise regression simulation. For example if I had the independent variables var1 var2 var3 var4 and the first stepwise simulation includes only var1 and var2 in the model, then only var1 and var2 will appear in subsequent models. If the first simulation includes var1 var2 and var3 then only var1 var2 and var3 will appear in subsequent models, assuming that they are significant (if not their coefficients will appear as dots).
For example, the incorrect output is featured below. The variables lweight, age, lbph, svi, gleason and pgg45 never appear if they do not appear in the first simulation.
_b_lweight _b_age _b_lbph _b_svi _b_lcp _b_gleason _b_pgg45 _b_lpsa
.4064831 .5390302
.2298697 .5591789
.2829061 .6279869
.5384691 .6027049
.3157105 .5523808
I want coefficients that are not included in the model to always appear as dots in the data set and I want subsequent simulations to not be seemingly dependent on the first simulation.
By using _b as a short-cut, the first iteration defined which coefficients were to be stored by simulate in all subsequent iterations. That is fine for most simulation programs, as those would use a fixed set of coefficients, but not what you want to use in combination with sw. So I adapted the program to explicitly list the coefficients (possibly missing when not selected) that are to be stored.
I also changed your programs such that they will run faster by avoiding mkmat and svmat and replacing those computations with predict and generate. I also changed it to make it fit more with conventions in the Stata community that a command will only replace a dataset in memory after a user explicitly asks for it by specifying the clear option. Finally I made sure that names of variables and scalars created in the program do not conflict with names already present in memory by using tempvar and tempname. These will also be automatically deleted when the program ends.
clear all
program define sw_pbs, rclass
syntax varlist, clear [reps(integer 100)]
gettoken depvar indepvar : varlist
foreach var of local indepvar {
local res "`res' `var'=r(`var')"
}
simulate `res', reps(`reps') : sw_pbs_simulator `varlist'
end
program define sw_pbs_simulator, rclass
syntax varlist
tempname rmse b
tempvar yhat y
gettoken depvar indepvar : varlist
reg `depvar' `indepvar'
scalar `rmse' = e(rmse)
predict double `yhat' if e(sample)
gen double `y' = `yhat' + rnormal(0, `rmse')
sw reg `y' `indepvar', pr(0.10) pe(0.05)
// start returning coefficients
matrix `b' = e(b)
local in : colnames `b'
local out : list indepvar - in
foreach var of local in {
return scalar `var' = _b[`var']
}
foreach var of local out {
return scalar `var' = .
}
end