Bootstrap method resampling in matlab - matlab

I am producing a script for creating bootstrap samples (random) from precipitations data set (sskt and kendall tau package in Matlab).
I have one double array with 3 colums from my data.
first is year, second a vector (for season or period) and third the precipitation of this station(vector is the number of the station, i run this method for regional trend).
1970 1 234
1971 1 244
1972 1 344
... ... ...
1970 2 342
1971 2 356
... ... ...
etc....i have a 36 years for each of my stations(12 stations=12x36=432data at 3 columns)
i want one m script file that i can call function sskt for N=5000repetitions of my data. My data is a csv file, actually a double matric in matlab. I want a bootstrap method of each column that generates 5000repetitions or 1000. 1000repetitions it means 1000x36=36000repetitions. When first loop of 1000 gives me results...in this loop i called function sskt and as results i have 1000 S slopes, 1000 kendall tau, 1000 sign.
Does anyone has an idea?

Matlab has a bootstrap function for called bootstrp. It draws N bootstrap data samples, computes statistics on each sample, and returns the results.

Related

Split dataset to specific number of sample by for loop

I have a 35136-by-1 matrix containing power data of 366 days with every day having 96 measurements). I want to take a sample from 252 days: power data of "day 1 to day 7" is the first sample, power data of "day 2 to day 8" is the second sample, etc.), and reshape my matrix to size [96 7 1 252].
I wrote following code, but I get 36 sample instead of 252
m=7;
for j=1
sample([j:96*m],:)=solarpower_n([j:96*m],:);
y([(96*m)+1:96*(m+1)],:)=solarpower_n([(96*m)+1:96*(m+1)],:);
m=m+1;
for j=2:246
sample([(96*(j-1))+1:96*m],:)=solarpower_n([(96*(j-1))+1:96*m],:);
y([(96*m)+1:96*(m+1)],:)=solarpower_n([(96*m)+1:96*(m+1)],:);
m=m+1;
end
end
I want to take sample from each 7 days. Assume D to be the number of days, and M as number of power measurements on each day. For 252 days, M=[1,2,3,...,96] and D=[1,2,...,252] . Thus the power of first day, P1, has a dimension of 96*1. I want to take sample1={P1,...,P7}, sample2={P2,...,P8} , .....,sample252={P246,.....,P252}. and have a [96 7 1 252] 4-D array.
How can I accomplish this?
Taking samples that way is rather inefficient, since you're copying each data point 7 times. You could simply use indexing:
A = rand(96*366, 1); % Sample data
B = reshape(A,[96 366]); % Reshape all your days in one go
B(:, 1:7) % first 7 days
B(:, 163:170) % Days 163 to 170, etc.
If you do want to copy your data seven times to your 4D array you can use a simple for loop:
A = rand(96*366, 1); % Sample data
% Note you need days 253:256, since P252 contains those days
B = reshape(A(1:96*(252+6)),[96 (252+6)]); % Reshape your first 252 days
C = zeros(size(B,1), 7, 1, size(B,2)-6); % Initialise output
for ii = 1:size(B,2)-6
C(:, :, :, ii) = B(:, ii:ii+6); % Save each 7 day sample
end
Getting rid of the for loop is difficult, given you want a sliding window. There are probably specialised functions for that somewhere, but given your data size a loop should be sufficiently performant.
For a short introduction on reshape() you can read this answer of mine.

how to derive meaning/values from cluster analysis results

I am currently doing my MasterThesis and for my MasterThesis I would like to create a simulation model which simulates the walking behavior of the older adults. However to make my simulation model easier, I want to form groups based on the cluster analysis so I can easily assign certain walking behavior to an older person if it belongs to a certain group(so if you belong to group 1, your walking time in min will be approximately 20 minutes for example).
However, I am not that familiar with cluster analysis. I have a big dataset containing many data of the characteristics of the older adults (** variables of discrete and continuous nature**), however the following characteristics are used currently based on literature:
age,gender, scorehealth, education categoy, income category, occupation, socialnetwork, yes/no living in a pleasant neighbourhood, yes/no feeling safe in the neighbourhood, the distance to green, having a dog, the walking time, walking in minutes.
After using the daisy function and using the silhouette method to define the ideal amount of clusters/thus groups, I got my clusters. However, now I was wondering how I should derive meaning from my clusters. I found it difficult to use statistical functions such as means, since I am dealing with categories. So what can I do to derive useful meaning/statistical conclusions from each clustergroup, such as if you belong to cluster group1, your incomelevel should be on average around incomegroup 10, age should be around 70 and the walkingtime in minutes is around 20 min for example. Ideally I also would like to have standard deviations of each varaibles in each cluster group.
So I can easily use these values in my simulation model to assign certain walking behavior to older adults.
#Joy, you should first determine the relevant variables. This will also help in dimensionality reduction. Since you've not given a sample dataset to work with, I'm creating my own. Also you must note, before cluster analysis, its important to obtain clusters that are pure. With purity, I mean the cluster must contain only those variables that account for maximum variance in the data. The variables that show little to negligible variance can best be removed for they are non-contributors to a cluster model. Once you've these (statistically) significant variables, cluster analysis will be meaningful.
Theoretical concepts
Clustering is a preprocessing algorithm. Its imperative to derive statistically significant variables to extract pure clusters. The derivation of these significant variables in a classification task is called feature selection whereas in a clustering task is called Principal Components (PCs). Historically the PCs are known to work only for continuous variables. To derive the PCs from categorical variable there is a method called Correspondence Analysis (CA) and for nominal categorical variables the method Multiple Correspondence Analysis (MCA) can be used.
Practical implementation
Let's create a data frame containing mixed variables (i.e. both categorical and continuous) like,
R> digits = 0:9
# set seed for reproducibility
R> set.seed(17)
# function to create random string
R> createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
R> df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
R> str(df)
'data.frame': 10 obs. of 6 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10
$ name : Factor w/ 10 levels "a","b","c","d",..: 2 9 4 6 3 7 1 8 10 5
$ studLoc : Factor w/ 10 levels "APBQD6181U","GOSWE3283C",..: 5 3 7 9 2 1 8 10 4 6
$ finalmark: int 53 73 95 39 97 58 67 64 15 81
$ subj1mark: int 63 18 98 83 68 80 46 32 99 19
$ subj2mark: int 90 40 8 14 35 82 79 69 91 2
I will inject random missing values in the data so that its more similar to real-world datasets.
# add random NA values
R> df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
R> colSums(is.na(df))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 2 2 0
As you can see the missing values are in continuous variables finalmark and subj1mark. I choose to do median imputation over mean because median is more robust than mean.
# Create a function to impute the missing values
R> ImputeMissing<- function(data=df){
# check if data frame
if(!(is.data.frame(df))){
df<- as.data.frame(df)
}
# Loop through the columns of the dataframe
for(i in seq_along(df))
{
if(class(df[,i]) %in% c("numeric","integer")){
# missing continuous data to be replaced by median
df[is.na(df[,i]),i] <- median(df[,i],na.rm = TRUE)
} # end inner-if
} # end for
return(df)
} # end function
# Remove the missing values
R> df.complete<- ImputeMissing(df)
# check missing values
R> colSums(is.na(df.complete))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 0 0 0
Now we can apply the method FAMD() from the FactoMineR package to the cleaned dataset. You can type, ??FactoMineR::FAMD in R console to look at the vignette of this method. From the vignette, FAMD is a principal component method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a disjunctive data table (crisp coding) and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability.
R> df.princomp <- FactoMineR::FAMD(df.complete, graph = FALSE)
Thereafter we can visualize the PCs using a screeplot shown in fig1 like,
R> factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
A Scree Plot (as shown in fig1) is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 44.5% of total variance. The question now naturally arises, "What are these variables?".
To extract the contribution of the variables, I've used fviz_contrib() shown in fig2 like,
R> factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
The fig2 above visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, studLoc, name, subj2markand finalMark are the most important variables that can be used for further analysis.
Now, you can proceed with cluster analysis.
# extract the important variables and store in a new dataframe
R> df.princomp.impvars<- df.complete[,c(2:3,6,4)]
# make the distance matrix
R> gower_dist <- cluster::daisy(df.princomp.impvars,
metric = "gower",
type = list(logratio = 3))
R> gower_mat <- as.matrix(gower_dist)
#make a hierarchical cluster model
R> model<-hclust(gower_dist)
#plotting the hierarchy
R> plot(model)
#cutting the tree at your decided level
R> clustmember<-cutree(model,3)
#adding the cluster member as a column to your data
R> df.clusters<-data.frame(df.princomp.impvars,cluster=clustmember)
R> df.clusters
name studLoc subj2mark finalmark cluster
1 b POTYQ0002N 90 53 1
2 i LWMTW1195I 40 73 1
3 d VTUGO1685F 8 95 2
4 f YCGGS5755N 14 70 1
5 c GOSWE3283C 35 97 2
6 g APBQD6181U 82 58 1
7 a VUJOG1460V 79 67 1
8 h YXOGP1897F 69 64 1
9 j NFUOB6042V 91 70 1
10 e QYTHG0783G 2 81 3

Matlab: spatial average in a 4d matrix (time, case, x, y)

Here is my dataset:
pressure(time, case, x, y)
>> size(pressure)
ans =
100 1 289 570
How to get a spatial nanmean pressure for x from 30 to 60 and y from 40 to 70 in each time step?
For example: a nanmean value for that particular region for each timestep from time 1 to time 100.
I tried this, "spatial_mean_pressure = nanmean(pressure(:,:,30:60,40:70))" It averaged the pressure in the timeserie. This is not the result I want.
>> size(spatial_mean_pressure)
ans =
1 1 31 31
I like to get the results like this:
>> size(spatial_mean_pressure)
ans =
100 1 1 1
You are trying to get a mean for an entire block of matrix. Therefore, you should apply nanmean twice and not once. Also, apply it along a particular dimension to get the desired result. I think this is what you want.
x=randi(10,[100 1 10 25]);
First take the mean along the third dimension.
mean_x_3=nanmean(x,3);
You would get an answer of size = [100 1 1 25]. Then take the mean along 4th dimension.
mean_x_4=nanmean(mean_x_3,4);
This should give you the desired answer. You can write this in one line as,
mean_x = nanmean(nanmean(x,3),4);

Understanding Histogram in Matlab

I got the following results after applying:[h,bins]=hist(data), such that, the data will contain the LBP (Local Binary Pattern) values.
h =
221 20 6 4 1 1 2 0 0 1
bins =
Columns 1 through 7
8.2500 24.7500 41.2500 57.7500 74.2500 90.7500 107.2500
Columns 8 through 10
123.7500 140.2500 156.7500
I want to ask the following:
Does the first bin represent the values 0-8.25 and the second bin the values 8.26-24.75, and so forth?
For the h value 221, does it mean that we have computed 221 an LBP value ranging from 0-8.25?
1) No. The bin location is in the center value of the bin, that is, for the first bin the values are 0-16.5, the second bin is 16.5-33, etc. Use histc if it is more natural to specify bin edges instead of centers.
2) h(1)=221 means that from your entire data set (that has 256 elements according to your question), 221 elements had values ranging between 0-16.5 .

Brain teaser - filtering algorithm using moving averages

I have a 1 second dataset of 86400 wind speed (WS) values in Matlab and need assistance in filtering it. It requires a certain level of cleverness.
If the average WS exceeds:
25m/s in a 600s time interval
28m/s in a 30s time interval
30m/s in a 3 s time interval
If any of these parameters are met, the WS is deemed 'invalid' until the average WS remains below 22m/s in a 300 s time interval.
Here is what I have for the 600 second requirement. I do a 600 and 300 second moving average on the data contained in 'dataset'. I filter the intervals from the first appearance of an average 25m/s to the next appearance of a value below 22m/s as 'NaN'. After filtering, I will do another 600 second average, and the intervals with values flagged with a NaN will be left a NaN.
i.e.
Rolling600avg(:,1) = tsmovavg(dataset(:,2), 's', 600, 1);
Rolling300avg(:,1) = tsmovavg(dataset(:,2), 's', 300, 1);
a = find(Rolling600avg(:,2)>25)
b = find(Rolling300avg(:,2)<22)
dataset(a:b(a:find(b==1)),2)==NaN; %?? Not sure
This is going to require a clever use of 'find' and some indexing. Could someone help me out? The 28m/s and 30m/s filters will follow the same method.
If I follow your question, one approach is to use a for loop to identify where the NaNs should begin and end.
m = [19 19 19 19 28 28 19 19 28 28 17 17 17 19 29 18 18 29 18 29]; %Example data
a = find(m>25);
b = find(m<22);
m2 = m;
% Use a loop to isolate segments that should be NaNs;
for ii = 1:length(a)
firstNull = a(ii)
lastNull = b( find(b>firstNull,1) )-1 % THIS TRIES TO FIND A VALUE IN B GREATER THAN A(II)
% IF THERE IS NO SUCH VALUE THEN NANS SHOULD FILL TO THE END OF THE VECTOR
if isempty(lastNull),
lastNull=length(m);
end
m2(firstNull:lastNull) = NaN
end
Note that this only works if tsmovavg returns an equal length vector as the one passed to it. If not then it's trickier and will require some modifications.
There's probably some way of avoiding a for loop but this is a pretty straight forward solution.