Clustering matrix distance between 3 time series - cluster-analysis

I have a question about the application of clustering techniques more concretely the K-means.
I have a data frame with 3 sensors (A,B,C):
time A | B | C |
8:00:00 6 10 11
8:30:00 11 17 20
9:00:00 22 22 15
9:30:00 20 22 21
10:00:00 17 26 26
10:30:00 16 45 29
11:00:00 19 43 22
11:30:00 20 32 22
... ... ... ...
And I want to group sensors that have the same behavior.
My question is: Looking at the dataframe above, I must calculate the correlation of each object of the data frame and then apply the Euclidean distance on this correlation matrix, thus obtaining a 3 * 3 matrix with the value of distances?
Or do I transpose my data frame and then compute the dist () matrix with Euclidean metric only and then I will have a 3 * 3 matrix with the distances value.

You have just three sensors. That means, you'll need three values, d(A B), d(B,C) and d(A B). Any "clustering" here does not seem to make sense to me? Certainly not k-means. K-means is for points (!) In R^d for small d.
Choose any form of time series similarity that you like. Could be simply correlation, but also DTW and the like.

Q1: No. Why: The correlation is not needed here.
Q2: No. Why: I'd calculate the distances differently
For the first row, R' built-in s dist() function (which uses Euclidean distance by default)
dist(c(6, 10, 11))
gives you the intervals between each value
1 2
------
2| 4
3| 5 1
item 2 and 3 are closest to each other. That's simple.
But there is no single way to calculate the distance between a point and a group of points. There you need a linkage function (min/max/average/...)
What I would do using R's built-in kmeans() function:
Ignore the date column,
(assuming there are no NA values in any A,B,C columns)
scale the data if necessary (here they all seem to have same order of magnitude)
perform KMeans analysis on the A,B,C columns, with k = 1...n ; evaluate results
perform a final KMeans with your suitable choice of k
get the cluster assignments for each row
put them in a new column to the right of C

Related

Correlation matrix from categorical and non-categorical variables (Matlab)

In Matlab, I have a dataset in a table of the form:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
UR F 12 U FT TEA MOTHER 1 11
GB M 22 R FT SER FATHER 5 15
GB M 12 R FT OTH FATHER 3 12
GB M 11 R PT POL FATHER 2 10
Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.
Anyone would have a take on how this can be done? Thank you.
As said above, you first need to transform your categorical and binary variables to numerical values.
So if your data is in a table (T) do something like:
T.SCHOOL = categorical(T.SCHOOL);
A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.
You could then transform your categorical columns to double:
T.SCHOOL = double(T.SCHOOL);
Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.
Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.
If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:
dummy_schools = dummyvar(T.SCHOOL);
Which returns a matrix of size nrows x unique(T.SCHOOL).
And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.
I hope this helps :)
I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
1 1 12 1 1 1 1 1 11
2 2 22 2 1 2 2 5 15
2 2 12 2 1 3 2 3 12
2 2 11 2 2 4 2 2 10
and then do the correlation.

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.

Finding matching rows from original dataset in an slightly reduced version of itself

I have two datasets, the original have all the labels and description of each variable, but the second is a reduced version of this dataset, used for specifics experiments, but don't have any of the information about the variables, contained in the original. So, I'm trying to match both datasets.
My question here is how can I find if a row from the original dataset is present in the new dataset, if a slight data reduction have been performed in both matrix dimensions?
Being more specific, the original dataset is a 24481 x 117 matrix and the new one is a 24188 x 97 matrix. However, the problem here is that I have no information of which rows or columns were or were not included in the new dataset
what you can do is zero pad the matrix with less number of elements so that it matches the size of the original data. then use
find(A==B)
A and B are the matrices
Using intersect function worked for me. Since a data reduction have been performed in both dimensions, first I look for the intersection of the first two columns vectors in the matrices (assuming that at least the columns order have been preserved in the reduction).
>> M = magic(5)
M =
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
11 18 25 2 9
>> X = M([2,3,5], [1,2,4,5])
X =
23 5 14 16
4 6 20 22
11 18 2 9
>> [c,xi, mi]=intersect(X(:,1),M(:,1))
mi is the column index vector of all rows from the original matrix M present in the reduced matrix X.
Doing the same for the two first rows in the matrices gave me a row index vector for all columns selected from the original matrix M.
>> [c,xi, mi]=intersect(X(1,:),M(1,:))
This solution has a drawback is that when the first row or column of the original matrix was not selected in the new set, then there you go moving the index of the compared vector from the original matrix, luckily not too much ;).
>> [c,xi, mi]=intersect(X(1,:),M(2,:))

calculating x2 from poisson distributed data

So I have a table of values
v=0 1 2 3 4 5 6 7 8 9
#times obs.: 5 19 23 21 14 12 3 2 1 0
I am supposed to calculate chi squared assuming the data fits a poisson dist. with mean u=3.
I have to group values >=6 all in one bin.
I am unsure of how to plot the poisson dist., and most of all how to control what goes into what bin, if that makes sense.
I have plotted a histogram using histc before..but it was with random numbers that I normalized. The amount in each bin was set for me.
I am super new...sorry if this question sucks.
You use bar to plot a bar graph in matlab.
So this is what you do:
v=0:9;
f=[5 19 23 21 14 12 3 2 1 0];
fc=f(find(v<6)); % copy elements where v<=6 into a new array
fc(end+1)=sum(f(v=>6)); % append the sum of elements where v=>6 to that array
figure
bar(v(v<=6), fc);
That should do the trick...
Now you didn't actually ask about the chi squared calculation. I would urge you not to put values of v>6 all into one bin for that calculation, as it will give you a really bad result.
There is another technique: if you use the hist function, you can choose the bins - and Matlab will automatically put things that exceed the limits into the last bin. So if your observations were in the array Obs, you can do what was asked with:
h = hist(Obs, 0:6);
figure
bar(0:6, h)
The advantage is that you have the array h available (frequencies) for other calculations.
If you do instead
hist(Obs, 0:6)
Matlab will plot the graph for you in a single statement (but you don't have the values...)

Matlab NN inputs & output maniuplation

Assume I have this matrix, A :
A=[ 25 11 2010 10 23 75
30 11 2010 11 24 45
31 12 2010 19 24 44
31 12 2010 22 27 32
1 1 2011 14 27 27
2 12 2011 15 28 30
3 12 2011 16 24 42 ];
The first 5 columns represent the inputs of some measured parameters and the last column is the corresponding output. The number of rows is the number of taking these measurements.
I want to use Matlab Neural network GRNN with the function newgrnn ( or any other NN function ) to train the data up to the 5th row and test the remaining 2 rows inputs to evaluate their corresponding outputs. I have tried many many times to do this but it always gives me error and the program did not run correctly. I have looked to newgrnn help example but it is only for one input while I have in this example 5 inputs.
My question is how do we put the inputs and the output in the newgrnn function structure. Actually, I have very large matrix with 22 inputs and one output and the size of my matrix is 26352 by 23 but the above is only sample example.
Since you haven't given any examples of what you've tried and what errors you get from your attempts, I'll have to give you a fairly generic answer.
Have a look at the newgrnn help file.
net = newgrnn(P,T,spread) takes three inputs,
P R-by-Q matrix of Q input vectors
T S-by-Q matrix of Q target class vectors
spread Spread of radial basis functions (default = 1.0)
So if your matrix A always has just the last column being the outputs (target class vectors) then the outputs (target class vectors) are A[1:5,end], and the inputs are A[1:5,1:(end-1)]. These say "first 5 rows of A, and the last column", and "first 5 rows of A, and all but the last column" respectively.
Then (simply following the example in the newgrnn help file, you will have to tweak to your own particular A):
net = newgrnn( A[1:5,1:(end-1)], A[1:5,end] )
% predict new values
Y = sim(net, A[6:7,1:(end-1)])
I think you should also read the Matlab help file for indexing arrays and matrices.