Mnist dataset pattern recognition accuracy - matlab

I'm completely new to matlab and this is my first project. Mnist has 60000 picture between 0 and 9 for training and 1000 picture to test. what I did is try to make a pattern for all of this 10 class (0 to 9) by using mean.then for for recognition I use Euclidean distance. this is very simple but the accuracy is really low.
I don't know where is exactly my problem to give my back this percentage of accuracy. the accuracy :1.73%
here is my code
finding 10 pattern for all of our class:
root = 'F:\matlab\ex1\exercise-EquivaliencOfL2DistanceAndDotProduct\dataset';
fn = strcat (root, '\MnistTrainX.mat');
load (fn);
fn = strcat (root, '\MnistTrainY.mat');
load (fn);
weights = zeros (10, 784);
b = zeros (10, 1);
im=reshape(MnistTrainX(5,:),[28 ,28]);
imshow(im,[]);
imshow(im',[]);
for c=1 : 10
idx=find(MnistTrainY == c-1);
weights (c,:)=mean( MnistTrainX(idx,:));
end
trainAccuray = ComputeInnerProductAccuracy(weights,b, MnistTrainX,MnistTrainY);
display(trainAccuray);
fn = strcat (root, '\MnistTestX.mat');
load (fn);
fn = strcat (root, '\MnistTestY.mat');
load (fn);
testAccuray = ComputeInnerProductAccuracy(weights, b, MnistTestX, MnistTestY);
display(testAccuray);
and this is accuracy function
function [acc]=ComputeInnerProductAccuracy(weights, b, X, Y)
n = size(X, 1);
minmat = zeros (60000, 2);
endmat = zeros (60000, 10);
m = size(X);
a=0;
for i=1 : n
for j=1 : 10
endmat(i,j)=sum((X(i,:)-(weights(j,:))).^2,2);
end
[minmat(i,1) ,minmat(i,2)]= min(endmat(i,:));
if minmat(i,2)== Y(i)
a=a+1;
end
end
acc=(a*100)/60000;
end

Your code is mostly correct, though it's quite inefficient. I won't spend the time to make it more efficient as there are many areas that need addressing. Instead I'll focus on what is wrong. There are two things wrong with the code. Firstly is when you find which digit has the lowest distance:
[minmat(i,1) ,minmat(i,2)]= min(endmat(i,:));
Note that the second output of min produces the location of where the minimum is starting at index 1. The class values in Y should contain 0 to 9 but the output index of min in your case is from 1 to 10. The output minimum indices and the corresponding class values are 1 off from each other, which is probably the reason why you have such bad accuracy.
Therefore, you have to subtract 1 from minmat(i, 2) before you check to see if the minimum label is indeed the ground truth... or you can simply add 1 to Y(i) when checking:
[minmat(i,1) ,minmat(i,2)]= min(endmat(i,:));
if minmat(i,2)== Y(i)+1 % Change
a=a+1;
end
The second thing that is incorrect is that the "inner product" function (actually you're computing the Euclidean distance.... but let's put that aside for this answer) assumes that there are always 60000 inputs yet your test set doesn't have this many inputs. This will work fine on your training data but it will report the wrong accuracy for your test data. Make sure you change all instances of 60000 in the function to n. This variable you've already created in your code and determines how many inputs there are.

Related

Minimize difference between indicator variables in Matlab

I'm new to Matlab and want to write a program that chooses the value of a parameter (P) to minimize the difference between two vectors, where each vector is a variable in a dataframe. The first vector (call it A) is a predetermined vector of 1s and 0s, and the second vector (call it B) has each of its entries determined as an indicator function that depends on the value of the parameter P and other variables in the dataframe. For instance, let C be a third variable in the dataset, so
A = [1, 0, 0, 1, 0]
B = [x, y, z, u, v]
where x = 1 if (C[1]+10)^0.5 - P > (C[1])^0.5 and otherwise x = 0, and similarly, y = 1 if (C[2]+10)^0.5 - P > (C[2])^0.5 and otherwise y = 0, and so on.
I'm not really sure where to start with the code, except that it might be useful to use the fminsearch command. Any suggestions?
Edit: I changed the above by raising to a power, which is closer to the actual example that I have. I'm also providing a complete example in response to a comment:
Let A be as above, and let C = [10, 1, 100, 1000, 1]. Then my goal with the Matlab code would be to choose a value of P to minimize the differences between the coordinates of the vectors A and B, where B[1] = 1 if (10+10)^0.5 - P > (10)^0.5 and otherwise B[1] = 0, and similarly B[2] = 1 if (1+10)^0.5 - P > (1)^0.5 and otherwise B[2] = 0, etc. So I want to choose P to maximize the likelihood that A[1] = B[1], A[2] = B[2], etc.
I have the following setup in Matlab, where ds is the name of my dataset:
ds.B = zeros(size(ds,1),1); % empty vector to fill
for i = 1:size(ds,1)
if ((ds.C(i) + 10)^(0.5) - P > (ds.C(i))^(0.5))
ds.B(i) = 1;
else
ds.B(i) = 0;
end
end
Now I want to choose the value of P to minimize the difference between A and B. How can I do this?
EDIT: I'm also wondering how to do this when the inequality is something like (C[i]+10)^0.5 - P*D[i] > (C[i])^0.5, where D is another variable in my dataset. Now P is a scalar being multiplied rather than just added. This seems more complicated since I can't solve for P exactly. How can I solve the problem in this case?
EDIT 1: It seems fminbnd() isn't optimal, likely due to the stairstep nature of the indicator function. I've updated to test the midpoints of all the regions between indicator function flips, plus endpoints.
EDIT 2: Updated to include dataset D as a coefficient of P.
If you can package your distance calculation up in a single function based on P, you can then search for its minimum.
arraySize = 1000;
ds.A = double(rand([arraySize,1]) > 0.5);
ds.C = rand(size(ds.A));
ds.D = rand(size(ds.A));
B = #(P)double((ds.C+10).^0.5 - P.*ds.D > ds.C.^0.5);
costFcn = #(P)sqrt(sum((ds.A-B(P)).^2));
% Solving the equation (C+10)^0.5 - P*D = C^0.5 for P, and sorting the results
BCrossingPoints = sort(((ds.C+10).^0.5-ds.C.^0.5)./ds.D);
% Taking the average of each crossing point with its neighbors
BMidpoints = (BCrossingPoints(1:end-1)+BCrossingPoints(2:end))/2;
% Appending endpoints onto the midpoints
PsToTest = [BCrossingPoints(1)-0.1; BMidpoints; BCrossingPoints(end)+0.1];
% Calculate the distance from A to B at each P to test
costResult = arrayfun(costFcn,PsToTest);
% Find the minimum cost
[~,lowestCostIndex] = min(costResult);
% Find the optimum P
optimumP = PsToTest(lowestCostIndex);
ds.B = B(optimumP);
semilogx(PsToTest,costResult)
xlabel('P')
ylabel('Distance from A to B')
1.- x is assumed positive real only, because with x<0 then complex values show up.
Since no comment is made in the question it seems reasonable to assume x real and x>0 only.
As requested, P 'the parameter' a scalar, P only has 2 significant states >0 or <0, let's see how is this:
2.- The following lines generate kind-of random A and C.
Then a sweep of p is carried out and distances d1 and d2 are calculated.
d1 is euclidean distance and d2 is the absolute of the difference between A and and B converting both from binary to decimal:
N=10
% A=[1 0 0 1 0]
A=randi([0 1],1,N);
% C=[10 1 1e2 1e3 1]
C=randi([0 1e3],1,N)
p=[-1e4:1:1e4]; % parameter to optimize
B=zeros(1,numel(A));
d1=zeros(1,numel(p)); % euclidean distance
d2=zeros(1,numel(p)); % difference distance
for k1=1:1:numel(p)
B=(C+10).^.5-p(k1)>C.^.5;
d1(k1)=(sum((B-A).^2))^.5;
d2(k1)=abs(sum(A.*2.^[numel(A)-1:-1:0])-sum(B.*2.^[numel(A)-1:-1:0]));
end
figure;
plot(p,d1)
grid on
xlabel('p');title('d1')
figure
plot(p,d2)
grid on
xlabel('p');title('d2')
The only degree of freedom to optimise seems to be the sign of P regardless of |P| value.
3.- f(p,x) has either no root, or just one root, depending upon p
The threshold funtion is
if f(x)>0 then B(k)==1 else B(k)==0
this is
f(p,x)=(x+10)^.5-p-x^.5
Now
(x+10).^.5-p>x.^.5 is same as (x+10).^.5-x.^.5>p
There's a range of p that keeps f(p,x)=0 without any (real) root.
For the particular case p=0 then (x+10).^.5 and x.^.5 do not intersect (until Inf reached = there's no intersection)
figure;plot(x,(x+10).^.5,x,x.^.5);grid on
[![enter image description here][3]][3]
y2=diff((x+10).^.5-x.^.5)
figure;plot(x(2:end),y2);
grid on;xlabel('x')
title('y2=diff((x+10).^.5-x.^.5)')
[![enter image description here][3]][3]
% 005
This means the condition f(x)>0 is always true holding all bits of B=1. With B=1 then d(A,B) turns into d(A,1), a constant.
However, for a certain value of p then there's one root and f(x)>0 is always false keeping all bits of B=0.
In this case d(A,B) the cost function turns into d(A,0) and this is A itself.
4.- P as a vector
The optimization gains in degrees of freedom if instead of P scalar, P is considered as vector.
For a given x there's a value of p that switches B(k) from 0 to 1.
Any value of p below such threshold keeps B(k)=0.
Equivalently, inverting f(x) :
g(p)=(10-p^2)^2/(4*p^2)>x
Values of x below this threshold bring B closer to A because for each element of B it's flipped to the element value of A.
Therefore, it's convenient to consider P as a vector, not a ascalar, and :
For all, or as many (as possible) elements of C to meet c(k)<(10-p^2)^2/(4*p^2) in order to get C=A or
minimize d(A,C)
5.- roots of f(p,x)
syms t positive
p=[-1000:.1:1000];
zp=NaN*ones(1,numel(p));
sol=zeros(1,numel(p));
for k1=1:1:numel(p)
p(k1)
eq1=(t+10)^.5-p(k1)-t^.5-p(k1)==0;
s1=solve(eq1,t);
if ~isempty(s1)
zp(k1)=s1;
end
end
nzp=~isnan(zp);
zp(nzp)
returns
=
620.0100 151.2900 64.5344 34.2225 20.2500 12.7211
8.2451 5.4056 3.5260 2.2500 1.3753 0.7803
0.3882 0.1488 0.0278

Repeated option pricing with Sobol Sequence (Matlab)

Trying to calculate the variance of a European option using repeated trial (instead of 1 trial). I want to compare the variance using the standard randn function and the sobolset. I'm not quite sure how to draw repeated samples from the latter.
Generating from randn is easy:
num_steps = 100;
num_paths = 10;
z = rand(num_steps, mum_paths); % 100 paths, for 10 trials
Once I have this, I can loop through all the 10 columns of the z matrix, and can also repeat the experiment many times, as the randn function will provide a new random variable set everytime.
for exp_num = 1: 20
for col = 1: 10
price_vec = z(:, col);
end
end
I'm not quite sure how to do this with the sobolset. I understand I can create a matrix of dimensions to start with (say 100* 10). I can loop through as above through all the columns for the first experiment. However, when I try the next experiment (#2), the loop starts from the beginning and all the numbers are the same. Meaning I don't get any variation in my pricing. It seems I will need to find a way to randomize the column selection at the start of every experiment number. Is there a better way to do this??
data1 = sobolset(1000, 'Skip', 1000, 'Leap', 100)
data2 = net(test1, 10)
for exp_num = 1: 20
% how do I change the start of the column selection here, so that the next data3 is different from %the one in the previous exp_num?
for col = 1:10
data3(:, col) = data(2:, col)
% perform calculations
end
end
I hope this is making sense....
Thanks for the help!
Update: 8/21
I tried the following:
num_runs = 100
num_samples = 1000
for j = 1: num_runs
for i = 1 : num_samples
sobol_set = sobolset(num_samples,'Skip',j*50,'Leap',1e2);
sobol_set = net(sobol_set, 5);
sobol_seq = sobol_set(:, i)';
z_uncorr = norminv(sobol_seq, 0, 1)
% do pricing with z_uncorr through some function F
end
end
After generating 100 prices (through some function F, mentioned above), I find that the variance of the 100 prices is higher than that I get from the standard pseudo random numbers. This should not be the case. I think I'm still not sampling correctly from the sobolset. Any advice would be appreciated.

Verify Law of Large Numbers in MATLAB

The problem:
If a large number of fair N-sided dice are rolled, the average of the simulated rolls is likely to be close to the mean of 1,2,...N i.e. the expected value of one die. For example, the expected value of a 6-sided die is 3.5.
Given N, simulate 1e8 N-sided dice rolls by creating a vector of 1e8 uniformly distributed random integers. Return the difference between the mean of this vector and the mean of integers from 1 to N.
My code:
function dice_diff = loln(N)
% the mean of integer from 1 to N
A = 1:N
meanN = sum(A)/N;
% I do not have any idea what I am doing here!
V = randi(1e8);
meanvector = V/1e8;
dice_diff = meanvector - meanN;
end
First of all, make sure everytime you ask a question that it is as clear as possible, to make it easier for other users to read.
If you check how randi works, you can see this:
R = randi(IMAX,N) returns an N-by-N matrix containing pseudorandom
integer values drawn from the discrete uniform distribution on 1:IMAX.
randi(IMAX,M,N) or randi(IMAX,[M,N]) returns an M-by-N matrix.
randi(IMAX,M,N,P,...) or randi(IMAX,[M,N,P,...]) returns an
M-by-N-by-P-by-... array. randi(IMAX) returns a scalar.
randi(IMAX,SIZE(A)) returns an array the same size as A.
So, if you want to use randi in your problem, you have to use it like this:
V=randi(N, 1e8,1);
and you need some more changes:
function dice_diff = loln(N)
%the mean of integer from 1 to N
A = 1:N;
meanN = mean(A);
V = randi(N, 1e8,1);
meanvector = mean(V);
dice_diff = meanvector - meanN;
end
For future problems, try using the command
help randi
And matlab will explain how the function randi (or other function) works.
Make sure to check if the code above gives the desired result
As pointed out, take a closer look at the use of randi(). From the general case
X = randi([LowerInt,UpperInt],NumRows,NumColumns); % UpperInt > LowerInt
you can adapt to dice rolling by
Rolls = randi([1 NumSides],NumRolls,NumSamplePaths);
as an example. Exchanging NumRolls and NumSamplePaths will yield Rolls.', or transpose(Rolls).
According to the Law of Large Numbers, the updated sample average after each roll should converge to the true mean, ExpVal (short for expected value), as the number of rolls (trials) increases. Notice that as NumRolls gets larger, the sample mean converges to the true mean. The image below shows this for two sample paths.
To get the sample mean for each number of dice rolls, I used arrayfun() with
CumulativeAvg1 = arrayfun(#(jj)mean(Rolls(1:jj,1)),[1:NumRolls]);
which is equivalent to using the cumulative sum, cumsum(), to get the same result.
CumulativeAvg1 = (cumsum(Rolls(:,1))./(1:NumRolls).'); % equivalent
% MATLAB R2019a
% Create Dice
NumSides = 6; % positive nonzero integer
NumRolls = 200;
NumSamplePaths = 2;
% Roll Dice
Rolls = randi([1 NumSides],NumRolls,NumSamplePaths);
% Output Statistics
ExpVal = mean(1:NumSides);
CumulativeAvg1 = arrayfun(#(jj)mean(Rolls(1:jj,1)),[1:NumRolls]);
CumulativeAvgError1 = CumulativeAvg1 - ExpVal;
CumulativeAvg2 = arrayfun(#(jj)mean(Rolls(1:jj,2)),[1:NumRolls]);
CumulativeAvgError2 = CumulativeAvg2 - ExpVal;
% Plot
figure
subplot(2,1,1), hold on, box on
plot(1:NumRolls,CumulativeAvg1,'b--','LineWidth',1.5,'DisplayName','Sample Path 1')
plot(1:NumRolls,CumulativeAvg2,'r--','LineWidth',1.5,'DisplayName','Sample Path 2')
yline(ExpVal,'k-')
title('Average')
xlabel('Number of Trials')
ylim([1 NumSides])
subplot(2,1,2), hold on, box on
plot(1:NumRolls,CumulativeAvgError1,'b--','LineWidth',1.5,'DisplayName','Sample Path 1')
plot(1:NumRolls,CumulativeAvgError2,'r--','LineWidth',1.5,'DisplayName','Sample Path 2')
yline(0,'k-')
title('Error')
xlabel('Number of Trials')

Dice simulation with matlab

I am new on this forum. First of all, I find it very interesting to have such a website were everyone can get help in different domains. Thank you very much.
So I have a problem: I was supposed to resolve the following problem:
Simulate with rand ntrials of rolling a dice.
if rand() in [0, 1/6] then 1 was thrown;
if rand() in (1/6, 2/6] then 2 was thrown
...
if rand() in (5/6, 1] then 6 was thrown.
Generate with hist an histogramm of the results of ntrials.
This is what I did:
ntrials = 100;
X = abs(rand(1,ntrials)*6) + 1;
hist(floo(X))
Now there is a second exercise that I must do:
two dice are thrown and S is the sum of the 2 dice
Compute the probability that S respectively accept one of the value 2,3,4,5.....12.
Write a Matlab function twoTimesDice that the theoritical result through a simulation of the throw of 2 dice like in the first exercise.
That is what I tryed:
function twoTimesDice
x1 = abs(rand(1,11))*6 + 1;
s1 = floor(x1); % probably result of the first dice
x2 = abs(rand(1,11))*6 +1;
s2 = floor(x2) % probably result of de second dice
S = s1 +s2;
hist(S);
end
Can you tell me please if I did it well?
Generating a dice roll between 1 and 6 can be done by randi().
So first, use randi() instead of floor() and abs():
X = randi(6,1,ntrials)
which will give you an array of length ntrials with random integers ranging from 1 to 6. (you need the 1 there or it will return a square matrix of size ntrials by ntrials). randi documentation
In the function my personal preference would be to request the number of trials as input.
Your function then becomes:
function twoTimesDice(ntrials)
s1 = randi(6,1,ntrials); % result of the first dice
s2 = randi(6,1,ntrials); % result of the second dice
S = s1 +s2;
hist(S);
end
For a normalised histogram, you can replace hist(S) by:
numOfBins = 11;
[histFreq, histXout] = hist(S, numOfBins);
figure;
bar(histXout, histFreq/sum(histFreq)*100);
xlabel('Value');ylabel('Percentage');
(As described in this question)
For the first part, I would use floor instead of abs,
X = floor(rand(1, ntrials)*6) + 1;
as it returns the values you are looking for, or as Daniel commented, use
randi(6)
which returns an integer.
Then you can just run
hist(X,6)
For the second part, I believe they are asking for two dice rolls, each being 1-6, and not one 2-12.
x = floor(rand(1)*6) + 1;
The distribution will look different. Roll those twice, add the result, that is your twoTimesDice function.
Roll that ntrials times, then do a histogram of that (as you already do).
I am not sure how random rand() really is though.

How can I speed up this call to quantile in Matlab?

I have a MATLAB routine with one rather obvious bottleneck. I've profiled the function, with the result that 2/3 of the computing time is used in the function levels:
The function levels takes a matrix of floats and splits each column into nLevels buckets, returning a matrix of the same size as the input, with each entry replaced by the number of the bucket it falls into.
To do this I use the quantile function to get the bucket limits, and a loop to assign the entries to buckets. Here's my implementation:
function [Y q] = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
if isvector(q)
q=transpose(q);
end
Y = zeros(size(X));
for i = 1:nLevels
% "The variables g and l indicate the entries that are respectively greater than
% or less than the relevant bucket limits. The line Y(g & l) = i is assigning the
% value i to any element that falls in this bucket."
if i ~= nLevels % "The default; doesnt include upper bound"
g = bsxfun(#ge,X,q(i,:));
l = bsxfun(#lt,X,q(i+1,:));
else % "For the final level we include the upper bound"
g = bsxfun(#ge,X,q(i,:));
l = bsxfun(#le,X,q(i+1,:));
end
Y(g & l) = i;
end
Is there anything I can do to speed this up? Can the code be vectorized?
If I understand correctly, you want to know how many items fell in each bucket.
Use:
n = hist(Y,nbins)
Though I am not sure that it will help in the speedup. It is just cleaner this way.
Edit : Following the comment:
You can use the second output parameter of histc
[n,bin] = histc(...) also returns an index matrix bin. If x is a vector, n(k) = >sum(bin==k). bin is zero for out of range values. If x is an M-by-N matrix, then
How About this
function [Y q] = levels(X,nLevels)
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
Y = zeros(size(X));
for i = 1:numel(q)-1
Y = Y+ X>=q(i);
end
This results in the following:
>>X = [3 1 4 6 7 2];
>>[Y, q] = levels(X,2)
Y =
1 1 2 2 2 1
q =
1 3.5 7
You could also modify the logic line to ensure values are less than the start of the next bin. However, I don't think it is necessary.
I think you shoud use histc
[~,Y] = histc(X,q)
As you can see in matlab's doc:
Description
n = histc(x,edges) counts the number of values in vector x that fall
between the elements in the edges vector (which must contain
monotonically nondecreasing values). n is a length(edges) vector
containing these counts. No elements of x can be complex.
I made a couple of refinements (including one inspired by Aero Engy in another answer) that have resulted in some improvements. To test them out, I created a random matrix of a million rows and 100 columns to run the improved functions on:
>> x = randn(1000000,100);
First, I ran my unmodified code, with the following results:
Note that of the 40 seconds, around 14 of them are spent computing the quantiles - I can't expect to improve this part of the routine (I assume that Mathworks have already optimized it, though I guess that to assume makes an...)
Next, I modified the routine to the following, which should be faster and has the advantage of being fewer lines as well!
function [Y q] = levels(X,nLevels)
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
if isvector(q), q = transpose(q); end
Y = ones(size(X));
for i = 2:nLevels
Y = Y + bsxfun(#ge,X,q(i,:));
end
The profiling results with this code are:
So it is 15 seconds faster, which represents a 150% speedup of the portion of code that is mine, rather than MathWorks.
Finally, following a suggestion of Andrey (again in another answer) I modified the code to use the second output of the histc function, which assigns entries to bins. It doesn't treat the columns independently, so I had to loop over the columns manually, but it seems to be performing really well. Here's the code:
function [Y q] = levels(X,nLevels)
p = linspace(0,1,nLevels+1);
q = quantile(X,p);
if isvector(q), q = transpose(q); end
q(end,:) = 2 * q(end,:);
Y = zeros(size(X));
for k = 1:size(X,2)
[junk Y(:,k)] = histc(X(:,k),q(:,k));
end
And the profiling results:
We now spend only 4.3 seconds in codes outside the quantile function, which is around a 500% speedup over what I wrote originally. I've spent a bit of time writing this answer because I think it's turned into a nice example of how you can use the MATLAB profiler and StackExchange in combination to get much better performance from your code.
I'm happy with this result, although of course I'll continue to be pleased to hear other answers. At this stage the main performance increase will come from increasing the performance of the part of the code that currently calls quantile. I can't see how to do this immediately, but maybe someone else here can. Thanks again!
You can sort the columns and divide+round the inverse indexes:
function Y = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
[S,IX]=sort(X);
[grid1,grid2]=ndgrid(1:size(IX,1),1:size(IX,2));
invIX=zeros(size(X));
invIX(sub2ind(size(X),IX(:),grid2(:)))=grid1;
Y=ceil(invIX/size(X,1)*nLevels);
Or you can use tiedrank:
function Y = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
R=tiedrank(X);
Y=ceil(R/size(X,1)*nLevels);
Surprisingly, both these solutions are slightly slower than the quantile+histc solution.