How can I find the difference between two plots with a dimensional mismatch? - matlab

I have a question that I don't know if there is a solution off the bat.
Here it goes,
I have two data sets, plotted on the same figure. I need to find their difference, simple so far...
the problem arises in the fact that say matrix A has 1000 data points while the second (matrix B) has 580 data points. How will I be able to find the difference between the two graphs since there is a dimensional miss match between the two figures.
One way that I thought of is artificially inflating matrix B to 1000 data points, but the trend of the plot will remain the same. Would this be possible? and if yes how?
for example:
A=[1 45 33 4 1009 ];
B=[1 22 33 44 55 66 77 88 99 1010];
Ya=A.*20+4;
Yb=B./10+3;
C=abs(B - A)
plot(A,Ya,'r',B,Yb)
xlim([-100 1000])
grid on
hold on
plot(length(B),C)

One way to do it is to resample the 580 element vector to 1000 samples. Use matlab resample (requires the Signal Processing Toolbox, I believe) for this:
x = randn(580,1);
y = randn(1000,1);
xr = resample(x, 50,29); # 50/29 = 1000/580 is the resampling ratio
You should then be able to compare the two data vectors.

There are two ways that I can think of:
1- Matching the size:
Generating more data for the matrix with lower number of elements (using interpolation, etc.)
Removing some data from the matrix with higher number of elements (i.e. outlier removal)
2- Comparing the matrices with their properties.
For instance, you can calculate the mean and the covariance of a matrix and compare it to the other matrix. The other options include, cov , mean , median , std, var , xcorr , xcov.

Related

How to quickly/easily merge and average data in matrix in MATLAB?

I have got a matrix of AirFuelRatio values at certain engine speeds and throttlepositions. (eg. the AFR is 14 at 2500rpm and 60% throttle)
The matrix is now 25x10, and the engine speed ranges from 1200-6000rpm with interval 200rpm, the throttle range from 0.1-1 with interval 0.1.
Say i have measured new values, eg. an AFR of 13.5 at 2138rpm and 74,3% throttle, how do i merge that in the matrix? The matrix closest values are 2000 or 2200rpm and 70 or 80% throttle. Also i don't want new data to replace the older data. How can i make the matrix take this value in and adjust its values to take the new value in account?
Simplified i have the following x-axis values(top row) and 1x4 matrix(below):
2 4 6 8
14 16 18 20
I just measured an AFR value of 15.5 at 3 rpm. If you interpolate the AFR matrix you would've gotten a 15, so this value is out of the ordinary.
I want the matrix to take this data and adjust the other variables to it, ie. average everything so that the more data i put in the more reliable and accurate the matrix becomes. So in the simplified case the matrix would become something like:
2 4 6 8
14.3 16.3 18.2 20.1
So it averages between old and new data. I've read the documentation about concatenation but i believe my problem can't be solved with that function.
EDIT: To clarify my question, the following visual clarification.
The 'matrix' keeps the same size of 5 points whil a new data point is added. It takes the new data in account and adjusts the matrix accordingly. This is what i'm trying to achieve. The more scatterd data i get, the more accurate the matrix becomes. (and yes the green dot in this case would be an outlier, but it explains my case)
Cheers
This is not a matter of simple merge/average. I don't think there's a quick method to do this unless you have simplifying assumptions. What you want is a statistical inference of the underlying trend. I suggest using Gaussian process regression to solve this problem. There's a great MATLAB toolbox by Rasmussen and Williams called GPML. http://www.gaussianprocess.org/gpml/
This sounds more like a data fitting task to me. What you are suggesting is that you have a set of measurements for which you wish to get the best linear fit. Instead of producing a table of data, what you need is a table of values, and then find the best fit to those values. So, for example, I could create a matrix, A, which has all of the recorded values. Let's start with:
A=[2,14;3,15.5;4,16;6,18;8,20];
I now need a matrix of points for the inputs to my fitting curve (which, in this instance, lets assume it is linear, so is the set of values 1 and x)
B=[ones(size(A,1),1), A(:,1)];
We can find the linear fit parameters (where it cuts the y-axis and the gradient) using:
B\A(:,2)
Or, if you want the points that the line goes through for the values of x:
B*(B\A(:,2))
This results in the points:
2,14.1897 3,15.1552 4,16.1207 6,18.0517 8,19.9828
which represents the best fit line through these points.
You can manually extend this to polynomial fitting if you want, or you can use the Matlab function polyfit. To manually extend the process you should use a revised B matrix. You can also produce only a specified set of points in the last line. The complete code would then be:
% Original measurements - could be read in from a file,
% but for this example we will set it to a matrix
% Note that not all tabulated values need to be present
A=[2,14; 3,15.5; 4,16; 5,17; 8,20];
% Now create the polynomial values of x corresponding to
% the data points. Choosing a second order polynomial...
B=[ones(size(A,1),1), A(:,1), A(:,1).^2];
% Find the polynomial coefficients for the best fit curve
coeffs=B\A(:,2);
% Now generate a table of values at specific points
% First define the x-values
tabinds = 2:2:8;
% Then generate the polynomial values of x
tabpolys=[ones(length(tabinds),1), tabinds', (tabinds').^2];
% Finally, multiply by the coefficients found
curve_table = [tabinds', tabpolys*coeffs];
% and display the results
disp(curve_table);

Randomly select Elements of 4D matrix in Matlab

I have a 4D matrix with dimensions 7x4x24x10 (Lets call it main_mat). I want to get a matrix of size 7x4x24 (rand_mat) so that each element of this (rand_mat) matrix is actually a uniformly random draw from the main matrix (main_mat). I am sorry if I have not put the question clearly, so I try to explain:
I have a stack of 24 sheets of 7x4 elements, and I have 10 such stacks. What I want is that I get a single stack of 24 sheets of 7x4 elements in such a way that every element out of resultant single stack is uniformly randomly drawn from exactly same sheet number from within 10 stacks. How can I do it without using loops?
If I am interpreting what you want correctly, for each unique 3D position in this matrix of 7 x 4 x 24, you want to be sure that we randomly sample from one out of the 10 stacks that share the same 3D spatial position.
What I would recommend you do is generate random integers that are from 1 to 10 that is of size 7 x 4 x 24 long, then use sub2ind along with ndgrid. You can certainly use randi as you have alluded to in the comments.
We'd use ndgrid to generate a grid of 3D coordinates, then use the random integers we generated to access the fourth dimension. Given the fact that your 4D matrix is stored in A, do something like this:
rnd = randi(size(A,4), size(A,1), size(A,2), size(A,3));
[R,C,D] = ndgrid(1:size(A,1), 1:size(A,2), 1:size(A,3));
ind = sub2ind(size(A), R, C, D, rnd);
B = A(ind);
Bear in mind that the above code will work for any 4D matrix. The first line of code generates a 7 x 4 x 24 matrix of random integers between [1,10]. Next, we generate a 3D grid of spatial coordinates and then use sub2ind to generate column-major indices where we can sample from the matrix A in such a way where each unique 3D spatial location of the matrix A only samples from one chunk and only one chunk. We then use these column-major indices to sample from A to produce our output matrix B.
This problem might not be solvable without the use of loops. One way that could work is:
mainMatrix = ... (7x4x24x10 matrix)
randMatrix = zeros(mainMatrix(:,1,1,1), mainMatrix(1,:,1,1), mainMatrix(1,1,:,1))
for x = 1:length(mainMatrix(:,1,1,1))
for y = 1:length(mainMatrix(1,:,1,1))
for z = 1:length(mainMatrix(1,2,:,1))
randMatrix(x,y,z) = mainMatrix(x,y,z,randi(10))
end
end
end

plotting multivariate data along eigenvectors

I have a data matrix contains 18 samples, each with 12 variables, D(18,12). I performed k-means clustering on the data to get 3 clusters. I want to visualize this data in 2 dimensions, specifically, along the 2 eigenvectors corresponding to the largest eigenvalues of a specific matrix, B. So, I create the plane spanned by two eigenvectors corresponding to the largest two eigenvalues:
[V,EA]=eig(B);
e1=V(:,11);
e2=V(:,12);
for i=1:12
E(i,1)=e1(i);
E(i,2)=e2(i);
end
Eproj=E*E';
where e1 and e2 are the eigenvectors, and E is a matrix containing those column vectors. At this point, I'm kind of stuck.
I recognize that e1 and e2 are orthogonal in this 12-d space, but I have no idea how this can reduce to two dimensions so I can plot it.
I believe that the projection of a data sample onto the plane would be:
Eproj*D(i,:)
for i=1...18, but I'm not sure where to go from here to plot my clusters. When I do the projection, its still in 12 dimensions.
Principal Component Analysis can help you to transform the data into 2D using the Eigenvectors.
coeff = princomp(B);
Bproj = B * coeff(:,1:2);
figure
plot(Bproj(:,1),Bproj(:,2),'*')
If you have the labels you can use the "scatter" function for a better visual. Or you can reduce the dimensionality to 3 and use "scatter3" function.

How to convert distance into probability?

Сan anyone shine a light to my matlab program?
I have data from two sensors and i'm doing a kNN classification for each of them separately.
In both cases training set looks like a set of vectors of 42 rows total, like this:
[44 12 53 29 35 30 49;
54 36 58 30 38 24 37;..]
Then I get a sample, e.g. [40 30 50 25 40 25 30] and I want to classify the sample to its closest neighbor.
As a criteria of proximity I use Euclidean metrics, sqrt(sum(Y2)), where Y is a difference between each element and it gives me an array of distances between Sample and each Class of Training Set.
So, two questions:
Is it possible to convert distance into distribution of probabilities, something like: Class1: 60%, Class 2: 30%, Class 3: 5%, Class 5: 1%, etc.
added: Up to this moment I'm using formula: probability = distance/sum of distances, but I cannot plot a correct cdf or histogram.
This gives me a distribution in some way, but I see a problem there, because if distance is large, for example 700, then the closest class will get a biggest probability, but it'd be wrong because the distance is too big to be compared with any of classes.
If I would be able to get two probability density functions, I guess then I would do some product of them. Is it possible?
Any help or remark is highly appreciated.
I think there are multiple way of doing this:
as Adam suggested using 1/d / sum(1/d)
use the square, or even higher ordered of inverse of distance, e.g 1/d^2 / sum(1/d^2), This will make the class probability distribution more skewed. For example if 1/d generated 40%/60% prediction, the 1/d^2 may gave a 10%/90%.
use softmax (https://en.wikipedia.org/wiki/Softmax_function), the exponential of negative distance.
use exp(-d^2)/sigma^2 / sum[exp(-d^2)/sigma^2], this will imitate the Gaussian Distribution likelihoods. Sigma could be the average within-cluster distance, or simply set to 1 for all clusters.
You could try to inverse your distances to get a likelihood measure. I.e. the bigger the distance x, the smaller the inverse of it. Then, you can normalize as in probability = (1/distance) / (sum (1/distance) )
Hi: Have you ever tried with the formula probability = 1-distance assuming that you are using a standardized distance between 0 and 1?

Mahalanobis distance between two vectors

I tried to apply mahal to calculate the Mahalanobis distance between 2 row-vectors of 27 variables, i.e mahal(X, Y), where X and Y are the two vectors. However, it comes up with an error:
The number of rows of X must exceed the number of columns.
After a few minutes of research I got that I can't use it like this, but I'm still not sure sure why. Can some explain it to me?
Also I have below an example of mahal method :
>> mahal([1.55 5 32],[5.76 43 34; 6.7 32 5; 3 3 5; 34 12 6;])
ans =
11.1706
Can someone clarify how MATLAB calculate the answer in this case?
Edit:
I found this code that calculate the mahalanobis distance:
S = cov(X);
mu = mean(X);
d = (Y-mu)*inv(S)*(Y-mu)'
d = ((Y-mu)/S)*(Y-mu)'; % <-- Mathworks prefers this way
I tested it on [1.55 5 32], and [5.76 43 34; 6.7 32 5; 3 3 5; 34 12 6;] and it gave me the same result as if I used the mahal function (11.1706), and I tried to calculate the distance between the 2 vectors of 27 variables and it works. What do you think about it? Can I count on this solution since the mahal function can't do what I need?
mahal(X,Y)... gave me this error:
"The number of rows of X must exceed the number of columns."
The documentation states that Y must have more rows than columns (also note that the documentation denotes X as the second input parameter, not the first). For you this means that the second array that you're feeding into mahal has more rows than columns.
Why is that so important? The purpose of this restriction is make sure that mahal has enough data to build the correlation matrix used in the computation of the Mahalanobis distance. If there's not enough information, the output would be garbage.
In your case your input arrays are two input vectors, each having 27 elements. Are the 27 elements correspond to different observations, or are they one observation of 27 variables? If it's the former, just make sure both vectors are column vectors:
mahal(X(:), Y(:))
and you're good to go. If each vector contains only one observation, your estimation of the covariance matrix will be entirely inaccurate. Again, the rows of the inputs should be the observations!
Can someone clarify how MATLAB calculated the answer in this case?
The Mahalanobis distance between two vectors x and y is: dM(x, y) = sqrt((x-y)TS-1(x-y)), where S is their covariance matrix.
In MATLAB1 mahal(Y,X) is efficiently implemented in the following manner:
m = mean(X,1);
M = m(ones(ry,1),:);
C = X - m(ones(rx,1),:);
[Q,R] = qr(C,0);
ri = R'\(Y-M)';
d = sum(ri.*ri,1)'*(rx-1);
You can verify that with:
type mahal
Note that MATLAB calculates the Mahalanobis distance in squared units, so in your example the Mahalanobis distance is actually the square root of 11.1706, i.e 3.3422.
Can I count on this [my] solution since the mahal function can't do what I need?
You're doing everything correctly, so it's safe to use. Having said that, note that MATLAB did restrict the dimensions of the second input array for a good reason (stated above).
If X contains only one row, cov automatically converts it to a column vector, which means that each value will be treated as a different observation. The resulting S would be inaccurate (if not garbage).
1 Checked for MATLAB release version R2007b.