Interpreting time series dimension? - matlab

I am wondering if anyone can explain the interpretation of the size (number of feature) in a time series? For example consider a simple script in Matlab
X= randn(2,5,2)
X(:,:,1) =
-0.5530 0.4291 0.3937 -1.2534 0.2811
-1.4926 -0.7019 -0.8305 -1.4034 1.9545
X(:,:,2) =
0.2004 0.1438 2.3655 -0.1589 0.7140
0.4905 0.2301 -0.7813 -0.6737 0.2552
Assume X is a time series with the following output
This generates 2 vectors of length 5 each has 2 rows. Can anyone tell me what is exactly the meaning of first 2 and 5?
In some websites it says a creating 5 vectors of length 5 and size 2. What does size mean here?
Is 2 like number of features and 5 is like number of time series. The reason for this confusion is because I do not understand how to interpret following sentence:
"Generate 2 vector-valued sequences of length 5; each vector has size
2."
What do size 2 and length 5 mean here?

This entirely depends on your data, and how you want to store this. If you have some 2D data over time, I find it convenient to have a data matrix with in the 1st and 2nd dimension the 2D data per time step, and in the 3rd dimension time.
Say I have a movie of 1920 by 1080 pixels with 100 frames, I'd store this as mov = rand(1080,1920,100) (1080 and 1920 swapped because of row, col order of indexing). So now mov(:,:,1) would give me the first frame etc.
BTW, your X is a normal array, not to be confused with the timeseries object.

Related

Splitting Vector into Sub-vectors MATLAB

I am trying to figure out how to split a vector in matlab into subvectors.
I am solving a differential equation numerically using dde23. When You do this the length of the solution vector changes. Thus, I am finding it not so easy to use the mat2cell command that many people suggest.
All I am trying to do is split (as evenly as possible) a vector of length N into an arbitrary amount of sub-vectors whose length may vary depending on the length of the time vector. I am doing this so then I can find the maximum value of each vector on in each interval.
If I understand the question, maybe you can try to split it by using code below
dataset=[ 1 2 3 4 5 6 7 8 9 10]
splitpoint = randi[2 length(dataset)-1]
subset1 = dataset(1,1:splitpoint)
splitpoint = randi[length(subset1)+1 length(dataset)-1]
subset2 = dataset(1,length(subset1)+1:splitpoint)
After that you can choose where to finish and accept rest of it for last subset or you can define one list to hold each subset in the row of the list. So you can define while loop to handle it automatically by defining stop_criteria.

Matlab Randi lab Vectors

Im doing a lab using matlab and have hit a bit of a snag. The prompt is:
a. Generate a vector to manipulate in the following exercises by using a random
number generator to create "pull-ups" counts for 50 people. The counts should be
from 1 to 10. Use this vector of counts for the next two problems.
b. How many people did more than 5 pull-ups? Do your results make sense for a
uniformly distributed random number generator?
c. Generate another vector for "pull-ups" counts 50 athletes, so this time use the
range from 11 to 20. Append this new vector to the previous vector (now you have
100 "pull-ups" counts).
d. Find the average number of "pull-ups" for the 100 total people. Do your results
make sense?
e. Use the 100 person vector in c and create a new vector that contains only the
counts from the odd-numbered índices (not the odd value counts, instead the
counts for every other person starting with person 1).
f. Use the 100 person vector in c and make a new vector of the "even-valued
counts".
Now, I can do parts a. and b. with no problem, but i do not have an idea on how to do part c. Ive been trying to do this
x=randi(20,11,50)
now i know that i get 110 values that range from 1 to 20 doing what i put above. But im trying to get 50 values from 11 to 20 and add those values to the vector in part a so that i have 100 values, with 50 ranging from 1-10 and the other 50 ranging from 11-20. Any idea what im doing wrong?
You need to provide an array as the first input to randi to specify the lower and upper limits of the random integers. If you specify just a scalar, then values between 1 and your provided values will be returned. The second and third inputs are the size of the output so we want the output to be 50 x 1
x = randi([11 20], 50, 1)

Unreasonable [positive] log-likelihood values from matlab "fitgmdist" function

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):
gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);
I get the following outputs:
iter log-likelihood
1 -6.66298e+07
2 -1.87763e+07
3 -5.00384e+06
4 -1.11863e+06
5 299767
6 985834
7 1.39525e+06
8 1.70956e+06
9 1.94637e+06
The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.
Could somebody help me?
First of all, it is not a problem of how large your dataset is.
Here is some code that produces similar results with a quite small dataset:
options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 64.4731
2 73.4987
3 73.4987
Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.
PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].
You can verify this by reducing the density in the above code
x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 -8.99083
2 -3.06465
3 -3.06465
So, I would rather assume that your dataset contains several duplicate (or very close) values.

Graphing different sets of data on same graph within a ‘for’ loop MATLAB

I just have a problem with graphing different plots on the same graph within a ‘for’ loop. I hope someone can be point me in the right direction.
I have a 2-D array, with discrete chunks of data in and amongst zeros. My data is the following:
A=
0 0
0 0
0 0
3 9
4 10
5 11
6 12
0 0
0 0
0 0
0 0
7 9.7
8 9.8
9 9.9
0 0
0 0
A chunk of data is defined as contiguous set of data, without interruptions of a [0 0] row. So in this example, the 1st chunk of data would be
3 9
4 10
5 11
6 12
And 2nd chunk is
7 9.7
8 9.8
9 9.9
The first column is x and second column is y. I would like to plot y as a function of x (x is horizontal axis, y is vertical axis) I want to plot these data sets on the same graph as a scatter graph, and put a line of best fit through the points, whenever I come across a chunk of data. In this case, I will have 2 sets of points and 2 lines of best fit (because I have 2 chunks of data). I would also like to calculate the R-squared value
The code that I have so far is shown below:
fh1 = figure;
hold all;
ah1 = gca;
% plot graphs:
for d = 1:max_number_zeros+num_rows
if sequence_holder(d,1)==0
continue;
end
c = d;
while sequence_holder(c,1)~=0
plot(ah1,sequence_holder(c,1),sequence_holder(c,num_cols),'*');
%lsline;
c =c+1;
continue;
end
end
Sequence holder is the array with the data in it. I can only plot the first set of data, with no line of best fit. I tried lsline, but that didn't work.
Can anyone tell me how to
-plot both sets of graphs
-how to draw a line of best fit a get the regression coefficient?
The first part could be done in a number of ways. I would test the second column for zeroness
zerodata = A(:,2) == 0;
which will give you a logical array of ones and zeros like [1 1 1 0 1 0 0 ...]. Then you can use this to split up your input. You could look at the diff of that array and test it for positive or negative sign. Your data starts on 0 so you won't get a transition for that one, so you'd need to think of some way to deal with that or the opposite case, unless you know for certain that it will always be one way or the other. You could just test the first element, or you could insert a known value at the start of your input array.
You will then have to store your chunks. As they may be of variable length and variable number you wouldn't put them into a big matrix, but you still want to be able to use a loop. I would use either a cell array, where each cell in a row contains the x or y data for a chunk, or a struct array where say structarray(1).x and structarray)1).y hold your data values.
Then you can iterate through your struct array and call plot on each chunk separately.
As for fitting you can use the fit command. It's complex and has lots of options so you should check out the help first (type doc fit inside the console to get the inline help, which is the same as the website help in content). The short version is that you can do a simple linear fit like this
[fitobject, gof] = fit(x, y, 'poly1');
where 'poly1' specifies you want a first order polynomial (i.e. straight line) and the output arguments give you a fit object, which you can do various things with like plot or interpolate, and the second gives you a struct containing among other things the r^2 and adjusted r^2. The fitobject also contains your fit coefficients.

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.