Calculating standard deviation using data points and count vector in Matlab - matlab

I want to use Matlab to calculate the standard deviation of a population I have compiled.
The matlab function takes as input one large population vector and outputs the standard dev.
However, for optimisation purposes, instead of one large vector, I have a set of individual data points, and the number of times each point is used.
I could use a loop and create a huge population vector, but it's not ideal.
How could I go about this?

Very easy from the definition of standard deviation: just introduce weights to account for the number of repetitions of each data point:
data = [1 3 4 2 4]; %// data values
count = [4 5 4 5 8]; %// number of times for each value
mu = sum(data.*count)./(sum(count));
dev = sqrt(sum((data-mu).^2.*count)./(sum(count)-1)); %// or ./sum(count)

Related

Distance Calculations for Nearest Mean Classifer

Greetins,
How can I calculate how many distance calculations would need to be performed to classify the IRIS dataset using Nearest Mean Classifier.
I know that IRIS dataset has 4 features and every record is classified according to 3 different labels.
According to some textbooks, the calculation can be carried out as follow:
However, I am lost on these different notations and what does this equation mean. For example, what is s^2 is in the equation?
The notation is standard with most machine learning textbooks. s in this case is the sample standard deviation for the training set. It is quite common to assume that each class has the same standard deviation, which is why every class is assigned the same value.
However you shouldn't be paying attention to that. The most important point is when the priors are equal. This is a fair assumption which means that you expect that the distribution of each class in your dataset are roughly equal. By doing this, the classifier simply boils down to finding the smallest distance from a training sample x to each of the other classes represented by their mean vectors.
How you'd compute this is quite simple. In your training set, you have a set of training examples with each example belonging to a particular class. For the case of the iris dataset, you have three classes. You find the mean feature vector for each class, which would be stored as m1, m2 and m3 respectively. After, to classify a new feature vector, simply find the smallest distance from this vector to each of the mean vectors. Whichever one has the smallest distance is the class you'd assign.
Since you chose MATLAB as the language, allow me to demonstrate with the actual iris dataset.
load fisheriris; % Load iris dataset
[~,~,id] = unique(species); % Assign for each example a unique ID
means = zeros(3, 4); % Store the mean vectors for each class
for i = 1 : 3 % Find the mean vectors per class
means(i,:) = mean(meas(id == i, :), 1); % Find the mean vector for class 1
end
x = meas(10, :); % Choose a random row from the dataset
% Determine which class has the smallest distance and thus figure out the class
[~,c] = min(sum(bsxfun(#minus, x, means).^2, 2));
The code is fairly straight forward. Load in the dataset and since the labels are in a cell array, it's handy to create a new set of labels that are enumerated as 1, 2 and 3 so that it's easy to isolate out the training examples per class and compute their mean vectors. That's what's happening in the for loop. Once that's done, I choose a random data point from the training set then compute the distance from this point to each of the mean vectors. We choose the class that gives us the smallest distance.
If you wanted to do this for the entire dataset, you can but that will require some permutation of the dimensions to do so.
data = permute(meas, [1 3 2]);
means_p = permute(means, [3 1 2]);
P = sum(bsxfun(#minus, data, means_p).^2, 3);
[~,c] = min(P, [], 2);
data and means_p are the transformed features and mean vectors in a way that is a 3D matrix with a singleton dimension. The third line of code computes the distances vectorized so that it finally generates a 2D matrix with each row i calculating the distance from the training example i to each of the mean vectors. We finally find the class with the smallest distance for each example.
To get a sense of the accuracy, we can simply compute the fraction of the total number of times we classified correctly:
>> sum(c == id) / numel(id)
ans =
0.9267
With this simple nearest mean classifier, we have an accuracy of 92.67%... not bad, but you can do better. Finally, to answer your question, you would need K * d distance calculations, with K being the number of examples and d being the number of classes. You can clearly see that this is required by examining the logic and code above.

Finding proportional columns in matrix

I have a big matrix (1,000 rows and 50,000 columns). I know some columns are correlated (the rank is only 100) and I suspect some columns are even proportional. How can I find such proportional columns? (one way would be looping corr(M(:,j),M(:,k))), but is there anything more efficient?
If I am understanding your problem correctly, you wish to determine those columns in your matrix that are linearly dependent, which means that one column is proportional or a scalar multiple of another. There's a very basic algorithm based on QR Decomposition. For QR decomposition, you can take any matrix and decompose it into a product of two matrices: Q and R. In other words:
A = Q*R
Q is an orthogonal matrix with each column as being a unit vector, such that multiplying Q by its transpose gives you the identity matrix (Q^{T}*Q = I). R is a right-triangular or upper-triangular matrix. One very useful theory by Golub and Van Loan in their 1996 book: Matrix Computations is that a matrix is considered full rank if all of the values of diagonal elements of R are non-zero. Because of the floating point precision on computers, we will have to threshold and check for any values in the diagonal of R that are greater than this tolerance. If it is, then this corresponding column is an independent column. We can simply find the absolute value of all of the diagonals, then check to see if they're greater than some tolerance.
We can slightly modify this so that we would search for values that are less than the tolerance which would mean that the column is not independent. The way you would call up the QR factorization is in this way:
[Q,R] = qr(A, 0);
Q and R are what I just talked about, and you specify the matrix A as input. The second parameter 0 stands for producing an economy-size version of Q and R, where if this matrix was rectangular (like in your case), this would return a square matrix where the dimensions are the largest of the two sizes. In other words, if I had a matrix like 5 x 8, producing an economy-size matrix will give you an output of 5 x 8, where as not specifying the 0 will give you an 8 x 8 matrix.
Now, what we actually need is this style of invocation:
[Q,R,E] = qr(A, 0);
In this case, E would be a permutation vector, such that:
A(:,E) = Q*R;
The reason why this is useful is because it orders the columns of Q and R in such a way that the first column of the re-ordered version is the most probable column that is independent, followed by those columns in decreasing order of "strength". As such, E would tell you how likely each column is linearly independent and those "strengths" are in decreasing order. This "strength" is exactly captured in the diagonals of R corresponding to this re-ordering. In fact, the strength is proportional to this first element. What you should do is check to see what diagonals of R in the re-arranged version are greater than this first coefficient scaled by the tolerance and you use these to determine which of the corresponding columns are linearly independent.
However, I'm going to flip this around and determine the point in the R diagonals where the last possible independent columns are located. Anything after this point would be considered linearly dependent. This is essentially the same as checking to see if any diagonals are less than the threshold, but we are using the re-ordering of the matrix to our advantage.
In any case, putting what I have mentioned in code, this is what you should do, assuming your matrix is stored in A:
%// Step #1 - Define tolerance
tol = 1e-10;
%// Step #2 - Do QR Factorization
[Q, R, E] = qr(A,0);
diag_R = abs(diag(R)); %// Extract diagonals of R
%// Step #3 -
%// Find the LAST column in the re-arranged result that
%// satisfies the linearly independent property
r = find(diag_R >= tol*diag_R(1), 1, 'last');
%// Step #4
%// Anything after r means that the columns are
%// linearly dependent, so let's output those columns to the
%// user
idx = sort(E(r+1:end));
Note that E will be a permutation vector, and I'm assuming you want these to be sorted so that's why we sort them after the point where the vectors fail to become linearly independent anymore. Let's test out this theory. Suppose I have this matrix:
A =
1 1 2 0
2 2 4 9
3 3 6 7
4 4 8 3
You can see that the first two columns are the same, and the third column is a multiple of the first or second. You would just have to multiply either one by 2 to get the result. If we run through the above code, this is what I get:
idx =
1 2
If you also take a look at E, this is what I get:
E =
4 3 2 1
This means that column 4 was the "best" linearly independent column, followed by column 3. Because we returned [1,2] as the linearly dependent columns, this means that columns 1 and 2 that both have [1,2,3,4] as their columns are a scalar multiple of some other column. In this case, this would be column 3 as columns 1 and 2 are half of column 3.
Hope this helps!
Alternative Method
If you don't want to do any QR factorization, then I can suggest reducing your matrix into its row-reduced Echelon form, and you can determine the basis vectors that make up the column space of your matrix A. Essentially, the column space gives you the minimum set of columns that can generate all possible linear combinations of output vectors if you were to apply this matrix using matrix-vector multiplication. You can determine which columns form the column space by using the rref command. You would provide a second output to rref that gives you a vector of elements that tell you which columns are linearly independent or form a basis of the column space for that matrix. As such:
[B,RB] = rref(A);
RB would give you the locations of which columns for the column space and B would be the row-reduced echelon form of the matrix A. Because you want to find those columns that are linearly dependent, you would want to return a set of elements that don't contain these locations. As such, define a linearly increasing vector from 1 to as many columns as you have, then use RB to remove these entries in this vector and the result would be those linearly dependent columns you are seeking. In other words:
[B,RB] = rref(A);
idx = 1 : size(A,2);
idx(RB) = [];
By using the above code, this is what we get:
idx =
2 3
Bear in mind that we identified columns 2 and 3 to be linearly dependent, which makes sense as both are multiples of column 1. The identification of which columns are linearly dependent are different in comparison to the QR factorization method, as QR orders the columns based on how likely that particular column is linearly independent. Because columns 1 to 3 are related to each other, it shouldn't matter which column you return. One of these forms the basis of the other two.
I haven't tested the efficiency of using rref in comparison to the QR method. I suspect that rref does Gaussian row eliminations, where the complexity is worse compared to doing QR factorization as that algorithm is highly optimized and efficient. Because your matrix is rather large, I would stick to the QR method, but use rref anyway and see what you get!
If you normalize each column by dividing by its maximum, proportionality becomes equality. This makes the problem easier.
Now, to test for equality you can use a single (outer) loop over columns; the inner loop is easily vectorized with bsxfun. For greater speed, compare each column only with the columns to its right.
Also to save some time, the result matrix is preallocated to an approximate size (you should set that). If the approximate size is wrong, the only penalty will be a little slower speed, but the code works.
As usual, tests for equality between floating-point values should include a tolerance.
The result is given as a 2-column matrix (S), where each row contains the indices of two rows that are proportional.
A = [1 5 2 6 3 1
2 5 4 7 6 1
3 5 6 8 9 1]; %// example data matrix
tol = 1e-6; %// relative tolerance
A = bsxfun(#rdivide, A, max(A,[],1)); %// normalize A
C = size(A,2);
S = NaN(round(C^1.5),2); %// preallocate result to *approximate* size
used = 0; %// number of rows of S already used
for c = 1:C
ind = c+find(all(abs(bsxfun(#rdivide, A(:,c), A(:,c+1:end))-1)<tol));
u = numel(ind); %// number of columns proportional to column c
S(used+1:used+u,1) = c; %// fill in result
S(used+1:used+u,2) = ind; %// fill in result
used = used + u; %// update number of results
end
S = S(1:used,:); %// remove unused rows of S
In this example, the result is
S =
1 3
1 5
2 6
3 5
meaning column 1 is proportional to column 3; column 1 is proportional to column 5 etc.
If the determinant of a matrix is zero, then the columns are proportional.
There are 50,000 Columns, or 2 times 25,000 Columns.
It is easiest to solve the determinant of a 2 by 2 matrix.
Hence: to Find the proportional matrix, the longest time solution is to
define the big-matrix on a spreadsheet.
Apply the determinant formula to a square beginning from 1st square on the left.
Copy it for every row & column to arrive at the answer in the next spreadsheet.
Find the Columns with Determinants Zero.
This is Quite basic,not very time consuming and should be result oriented.
Manual or Excel SpreadSheet(Efficient)

Find the closest weight vector to each instance in the data matrix

Suppose I have a weight matrix W nxm where m is the number of variables and the n is the number of instances. Also I have data matrix X of the same size. I try to find the closest weight vector to each instance in X. However both matrices are so dimensional therefore plain methods are not sufficient enough. I have tried some GPU trick at MATLAB but it does not work well since it was sequential approach that was calculating the closest weight for each instance sequentially. I am now looking for efficient one shot code. That takes all the W and X and find the winner with some MATLAB tricks with possibly some GPU addition. Is there any one that can suggest any code snippet in the MATLAB?
This is the thing that I wrote for sequential
x_in_d = gpuArray(x_in); % take input instance to device
W_d = gpuArray(W); % take weight matrix to device
Dx = W_d - x_in_d(ones(size(W_d,1),1),logical(ones(1,length(x_in_d))));
[d_min,winner] = min(sum((Dx.^2)'));
d_min = gather(d_min); %gather results
winner = gather(winner);
What do you mean by so dimensional? It's just an m x n matrix right?
It would be really helpful if you could provide some sample data, based off your description (which isn't the clearest), here is what I think your data looks like.
weights=
[1 4 2
5 3 1]
data=
[2 5 1
1 2 2]
And you want to figure out which row of weights is closest to the row of data? Which in this case would be the first row of weights for both rows of data.
Please edit your question to clarify what your asking for and consider using some examples.
EDIT:
I like Rody's Dup. Comment, if I am correct, check out: Link Here

How to find the row rank of matrix in Galois fields?

Matlab has a built-in function for calculating rank of a matrix with decimal numbers as well as finite field numbers. However if I am not wrong they calculate only the lowest rank (least of row rank and column rank). I would like to calculate only the row rank, i.e. find the number of independent rows of a matrix (finite field in my case). Is there a function or way to do this?
In linear algebra the column rank and the row rank are always equal (see proof), so just use rank
(if you're computing the the rank of a matrix over Galois fields, consider using gfrank instead, like #DanBecker suggested in his comment):
Example:
>> A = [1 2 3; 4 5 6]
A =
1 2 3
4 5 6
>> rank(A)
ans =
2
Perhaps all three columns seem to be linearly independent, but they are dependent:
[1 2; 4 5] \ [3; 6]
ans =
-1
2
meaning that -1 * [1; 4] + 2 * [2; 5] = [3; 6]
Schwartz,
Two comments:
You state in a comment "The rank function works just fine in Galois fields as well!" I don't think this is correct. Consider the example given on the documentation page for gfrank:
A = [1 0 1;
2 1 0;
0 1 1];
gfrank(A,3) % gives answer 2
rank(A) % gives answer 3
But it is possible I am misunderstanding things!
You also said "How to check if the rows of a matrix are linearly independent? Does the solution I posted above seem legit to you i.e. taking each row and finding its rank with all the other rows one by one?"
I don't know why you say "find its rank with all the other rows one by one". It is possible to have a set of vectors which are pairwise linearly independent, yet linearly dependent taken as a group. Just consider the vectors [0 1], [1 0], [1 1]. No vector is a multiple of any other, yet the set is not linearly independent.
Your problem appears to be that you have a set of vector that you know are linearly independent. You add a vector to that set, and want to know whether the new set is still linearly independent. As #EitanT said, all you need to do is combine the (row) vectors into a matrix and check whether its rank (or gfrank) is equal to the number of rows. No need to do anything "one-by-one".
Since you know that the "old" set is linearly independent, perhaps there is a nice fast algorithm to check whether the new vector makes thing linearly dependent. Maybe at each step you orthogonalize the set, and perhaps that would make the process of checking for linear independence given the new vector faster. That might make an interesting question somewhere like mathoverflow.

Creating a new probabilistic matrix from two existing ones according to prespecified rules in MATLAB

I have a problem in my MATLAB code. Let me first give you some explanation about the issue. I have two matrices which represent probabilities of specific outcomes of events. The first one is called DemandProbabilityMatrix or in short DemandP. Entry (i,j) shows the probability that item i is demanded j many times. Similarly, we have a ReturnProbabilityMatrix, i.e. ReturnP. An element of type (i,j) stores the probability that item i is returned j many times.
We want to compute the net demand probability out of these two matrices. For an example:
DemandP=[ .4 .5 .1]
ReturnP=[ .2 .3 .5]
In this case we have 1 item and it can be demanded or returned either 1,2 or 3 times with the given probabilities. To be more specific That item will be demanded just for once with probability .4 .
Then we need to compute the net demand. In this case, net demand can be -2,-1,0,1 or 2. For instance in order to get a net demand of -1 we can either have a demand of 1 and return of 2 or demand of 2 and return of 3. Thus we have
NetDemandP(1,2)= DemandP(1,1)*ReturnP(1,2)+DemandP(1,2)*ReturnP(1,3).
Thus the NetDemandP should look as:
NetDemandP=[.20 .37 .28 .13 .02]
I can do this with nested for loops but I'm trying to come up with a faster way. In case it helps I have the following for loops solutions where I denotes the number of rows in ReturnP and DemandP, J+1 denotes the number of columns in those matrices.
NetDemandP=zeros(I,2*J+1);
for i=1:I
for j=1:J+1
for k=1:J+1
NetDemandP(i,j-k+J+1)=NetDemandP(i,j-k+J+1)+DemandP(i,j)*ReturnP(i,k);
end
end
end
Thanks in advance
What you want is the convolution of your probability density functions. Or, more specifically, you want the convolution of the demand density with the reverse of the return density. This is easily achieved in Matlab. For example:
DemandP = [.4 .5 .1];
ReturnP = [.2 .3 .5];
NetDemandP = conv(DemandP,fliplr(ReturnP))
If you have matrices instead of vectors, then just iterate through the rows:
for i = 1:size(DemandP,1)
NetDemandP(i,:) = conv(DemandP(i,:),fliplr(ReturnP(i,:)))
end