How should I incorporate State N as a weight variable? - linear-regression

My data structure is:
State N Var1 Var2
Alabama 23 54 42
Alaska 4 53 53
Arizona 53 75 65
Var1 and Var2 are aggregated percentage values at the state level. N is the number of participants in each state. I would like to run a linear regression between Var1 and Var2 with the consideration of N as weight, what is the best way to do it in SPSS?

You can either use WEIGHT BY or use a subcommand on REGRESSION - examples below.
DATA LIST FREE / State (A15) N Var1 Var2 (3F2.0).
BEGIN DATA
Alabama 23 54 42
Alaska 4 53 53
Arizona 53 75 65
END DATA.
WEIGHT BY N.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Var1
/METHOD=ENTER Var2.
*Or using subcommand.
WEIGHT OFF.
REGRESSION
/MISSING LISTWISE
/REGWGT=N
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Var1
/METHOD=ENTER Var2.
You can see for yourself that these two procedures produce the same estimates. And that if you run the regression without weighting or using the REGWGT subcommand it will result in different estimates for this example.

What is the reason for using the N as a weight? If you do want to do that, since regression considers the weight to be a replication weight, you need to be careful about your degrees of freedom.

Related

Matlab fitrensemble capped to 99 observations

Probably, a very simple fix, but I've been trying to run random forest algo on some data using the built-in fitrensemble method.
I pass in a 2D matrix X with the training data, and the vector output as Y.
These matrices have 3499 rows each. X has 6 columns and Y 1.
Yet, I get the following output from my fitrensemble call:
ResponseName: 'Y'
CategoricalPredictors: []
ResponseTransform: 'none'
NumObservations: 99
NumTrained: 100
Method: 'LSBoost'
LearnerNames: {'Tree'}
ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'
FitInfo: []
FitInfoDescription: 'None'
Regularization: []
FResample: 1
Replace: 1
UseObsForLearner: [99×100 logical]
NumObservations is always capped at 99... why is that? Note that when I reduce the size of the training data to have less than 99 rows, then the value of 'NumObservations` comes down to match it. I tried setting it as an argument but that didn't work either.

Calculating the probability of success in k (or less) Bernoulli trials out of n using matlab

I am trying to calculate the probability of success in 70 (or less) Bernoulli trials out of 100. I wrote it with Matlab. But, I get the probability to be 1 (it can't be 1 since its not success in all 100 trials).
Is my function OK?
syms k
f = nchoosek(100,k)*0.5^k*0.5^(100-k);
F = double(symsum(nchoosek(100,k)*0.5^k*0.5^(100-k),k,0,70));
If it is, how can I get a more accurate resault in Matlab?
Thanks
edit:
I have have a binary vector that represents success/failure in n trials (like tossing a coin 100 times). And I need the error of my sample (the way statistics does it.. but I don't know statistics). So I thought that maybe i will try to calculate "how far am I from being correct in all trials" which should be 1-F in my code. But then 70 successes out of 100 gives me error = 0 which is obviously not true..
edit2: In the example I gave here I need the probability that there are 70 successes in 100 trials.
You do have everything you need to answer this question.
In the formula you have posted, you sum the probabilities from 0 to 70, that is, it will calculate the probability to have 0 or 1 or 2 .. or 70 successes, which means 70 or less successes.
Without the sum, you get the probability to have exactly k successes. The probability to get exactly 70 successes is:
k = 70;
f = nchoosek(100,k)*0.5^k*0.5^(100-k)
Warning: Result may not be exact. Coefficient is greater than 9.007199e+15 and is only
accurate to 15 digits
> In nchoosek (line 92)
f =
2.3171e-05
You receive a warning that the computation of nchoosek(100,70) is not exact (see below for a better way).
To compute the probability to get 70 or less successes, sum over the probabilities to get 0 or 1 or .. 70 successes:
>> f = 0;
>> for k=0:70;
f = f + nchoosek(100,k)*.5^k*.5^(100-k);
end
You will receive a lot of warnings, but you can look at f:
>> f
f =
1.0000
As you see, if rounded to four digits, the probability is 1. We know, however, that it must be slighly less than one´. If we ask Matlab to show more digits:
>> format long
we see that it is not exactly 1:
>> f
f =
0.999983919992352
If you compute 1-f, you will see that the result is not 0 (I switch back to showing less digits):
>> format short
>> 1-f
ans =
1.6080e-05
To get rid of the warnings and to simplify the code for computing the probabilities, Matlab provides several functions to deal with binomial distributions. For the probability to get exactly 70 successes, use
>> binopdf(70,100,.5)
ans =
2.3171e-05
and to get 70 or less successes:
>> format long
>> binocdf(70,100,.5)
ans =
0.999983919992352

Fast intersection of several interval ranges?

I have several variables, all of which are numeric ranges: (intervals in rows)
a = [ 1 4; 5 9; 11 15; 20 30];
b = [ 2 6; 12 14; 19 22];
c = [ 15 22; 24 29; 33 35];
d = [ 0 3; 15 17; 23 26];
(The values in my real dataset are not integers, but are represented here as such for clarity).
I would like to find intervals in which at least 3 of the variables intersect. In the above example [20 22] and [24 26] would be two such cases.
One way to approach this is to bin my values and add the bins together, but as my values are continuous, that'd create an 'edge effect', and I'd lose time binning the values in the first place. (binning my dataset at my desired resolution would create hundreds of GB of data).
Another approach, which doesn't involve binning, would use pairwise intersections (let's call it X) between all possible combinations of variables, and then the intersection of X with all other variables O(n^3).
What are your thoughts on this? Are there algorithms/libraries which have tools to solve this?
I was thinking of using sort of a geometric approach to solve this: Basically, if I considered that my intervals were segments in 1D space, then my desired output would be points where three segments (from three variables) intersect. I'm not sure if this is efficient algorithmically though. Advice?
O(N lg N) method:
Convert each interval (t_A, t_B) to a pair of tagged endpoints ('begin', t_A), ('end', t_B)
Sort all the endpoints by time, this is the most expensive step
Do one pass through, tracking nesting depth (increment if tag is 'start', decrement if tag is 'end'). This takes linear time.
When the depth changes from 2 to 3, it's the start of an output interval.
When it changes from 3 to 2, it's the end of an interval.

how to derive meaning/values from cluster analysis results

I am currently doing my MasterThesis and for my MasterThesis I would like to create a simulation model which simulates the walking behavior of the older adults. However to make my simulation model easier, I want to form groups based on the cluster analysis so I can easily assign certain walking behavior to an older person if it belongs to a certain group(so if you belong to group 1, your walking time in min will be approximately 20 minutes for example).
However, I am not that familiar with cluster analysis. I have a big dataset containing many data of the characteristics of the older adults (** variables of discrete and continuous nature**), however the following characteristics are used currently based on literature:
age,gender, scorehealth, education categoy, income category, occupation, socialnetwork, yes/no living in a pleasant neighbourhood, yes/no feeling safe in the neighbourhood, the distance to green, having a dog, the walking time, walking in minutes.
After using the daisy function and using the silhouette method to define the ideal amount of clusters/thus groups, I got my clusters. However, now I was wondering how I should derive meaning from my clusters. I found it difficult to use statistical functions such as means, since I am dealing with categories. So what can I do to derive useful meaning/statistical conclusions from each clustergroup, such as if you belong to cluster group1, your incomelevel should be on average around incomegroup 10, age should be around 70 and the walkingtime in minutes is around 20 min for example. Ideally I also would like to have standard deviations of each varaibles in each cluster group.
So I can easily use these values in my simulation model to assign certain walking behavior to older adults.
#Joy, you should first determine the relevant variables. This will also help in dimensionality reduction. Since you've not given a sample dataset to work with, I'm creating my own. Also you must note, before cluster analysis, its important to obtain clusters that are pure. With purity, I mean the cluster must contain only those variables that account for maximum variance in the data. The variables that show little to negligible variance can best be removed for they are non-contributors to a cluster model. Once you've these (statistically) significant variables, cluster analysis will be meaningful.
Theoretical concepts
Clustering is a preprocessing algorithm. Its imperative to derive statistically significant variables to extract pure clusters. The derivation of these significant variables in a classification task is called feature selection whereas in a clustering task is called Principal Components (PCs). Historically the PCs are known to work only for continuous variables. To derive the PCs from categorical variable there is a method called Correspondence Analysis (CA) and for nominal categorical variables the method Multiple Correspondence Analysis (MCA) can be used.
Practical implementation
Let's create a data frame containing mixed variables (i.e. both categorical and continuous) like,
R> digits = 0:9
# set seed for reproducibility
R> set.seed(17)
# function to create random string
R> createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
R> df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
R> str(df)
'data.frame': 10 obs. of 6 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10
$ name : Factor w/ 10 levels "a","b","c","d",..: 2 9 4 6 3 7 1 8 10 5
$ studLoc : Factor w/ 10 levels "APBQD6181U","GOSWE3283C",..: 5 3 7 9 2 1 8 10 4 6
$ finalmark: int 53 73 95 39 97 58 67 64 15 81
$ subj1mark: int 63 18 98 83 68 80 46 32 99 19
$ subj2mark: int 90 40 8 14 35 82 79 69 91 2
I will inject random missing values in the data so that its more similar to real-world datasets.
# add random NA values
R> df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
R> colSums(is.na(df))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 2 2 0
As you can see the missing values are in continuous variables finalmark and subj1mark. I choose to do median imputation over mean because median is more robust than mean.
# Create a function to impute the missing values
R> ImputeMissing<- function(data=df){
# check if data frame
if(!(is.data.frame(df))){
df<- as.data.frame(df)
}
# Loop through the columns of the dataframe
for(i in seq_along(df))
{
if(class(df[,i]) %in% c("numeric","integer")){
# missing continuous data to be replaced by median
df[is.na(df[,i]),i] <- median(df[,i],na.rm = TRUE)
} # end inner-if
} # end for
return(df)
} # end function
# Remove the missing values
R> df.complete<- ImputeMissing(df)
# check missing values
R> colSums(is.na(df.complete))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 0 0 0
Now we can apply the method FAMD() from the FactoMineR package to the cleaned dataset. You can type, ??FactoMineR::FAMD in R console to look at the vignette of this method. From the vignette, FAMD is a principal component method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a disjunctive data table (crisp coding) and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability.
R> df.princomp <- FactoMineR::FAMD(df.complete, graph = FALSE)
Thereafter we can visualize the PCs using a screeplot shown in fig1 like,
R> factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
A Scree Plot (as shown in fig1) is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 44.5% of total variance. The question now naturally arises, "What are these variables?".
To extract the contribution of the variables, I've used fviz_contrib() shown in fig2 like,
R> factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
The fig2 above visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, studLoc, name, subj2markand finalMark are the most important variables that can be used for further analysis.
Now, you can proceed with cluster analysis.
# extract the important variables and store in a new dataframe
R> df.princomp.impvars<- df.complete[,c(2:3,6,4)]
# make the distance matrix
R> gower_dist <- cluster::daisy(df.princomp.impvars,
metric = "gower",
type = list(logratio = 3))
R> gower_mat <- as.matrix(gower_dist)
#make a hierarchical cluster model
R> model<-hclust(gower_dist)
#plotting the hierarchy
R> plot(model)
#cutting the tree at your decided level
R> clustmember<-cutree(model,3)
#adding the cluster member as a column to your data
R> df.clusters<-data.frame(df.princomp.impvars,cluster=clustmember)
R> df.clusters
name studLoc subj2mark finalmark cluster
1 b POTYQ0002N 90 53 1
2 i LWMTW1195I 40 73 1
3 d VTUGO1685F 8 95 2
4 f YCGGS5755N 14 70 1
5 c GOSWE3283C 35 97 2
6 g APBQD6181U 82 58 1
7 a VUJOG1460V 79 67 1
8 h YXOGP1897F 69 64 1
9 j NFUOB6042V 91 70 1
10 e QYTHG0783G 2 81 3

How to write vectorization code for 2 matrix

I have two matrices like this:
gt = [30 40 20 40] and
de = [32 42 20 40; 34 12 20 40; 36 84 20 40]
I want to calculate the overlap area between gt and 3 rows of de respectively and the overlap is calculated by a function I write myself. Then I want to store the result in a new column vector like
result = [result1; result2; result3].
Could you tell me how to write a vectorization codes to achieve this?
Thanks!
The vectorization can only happen inside the overlap function. The only thing you can do outside it is replicate the vector gt, using repmat or bsxfun. You don't explain how the overlap function works. I suppose it has to do with co-ordinates, so I give an example for euclidean distance which works in a similar logic.
If you had to calculate the distance between point gt = [1 2] and points de = [5 6; 10 12; 0 -1] you would define
function result = dist(x, y)
result = sum(sqrt((x(:,1) - y(:,1)).^2 + (x(:,2) - y(:,2)).^2), 2)
and you would call it replicating the gt vector
dist(de, repmat(gt, 3, 1))
Alternatively, you could use bsxfun instead of repmat, which might have better performance (depending on various factors)
The key to vectorizing is performing operations column-wise (in this specific case it could be vectorized even further, however I am writing it this way to emphasize the column-wise operations)