Let us say I have 'n' vectors of data of unequal lengths. All of these vectors are similar (range, etc., see example below), and can be fitted into a specific probability distribution. How can we average out these distributions? Does it make sense? If yes, how do I go about programming it ?
For example:-
for n=2, for data1 of 400 samples, I get normal distribution with range 1 to 5, and mean 3 and standard deviation 0.75.
for data2 of 500 samples, I get normal distribution with range 0.95 to 5.2, and mean 3.05 and standard deviation 0.78.
Related
I have values for 10th-25th percentile range which is 0.49, 25th-50th percentile is 1.36 (this is peak), 50th-75th percentile is 0.18, >90th percentile is 0.15.
I want to interpolate the values for the ranges >5, 5th-10th, 75th-90th percentile. How to do that in MATLAB?
If I assume a normal distribution, whose peak is 1.36 (25th-50th percentile) (as shown in figure attached), how to interpolate the values of unknown percentile ranges?
Actually, performing an interpolation in order to find percentile values looks not very good to me. If you are dealing with a normal distribution and its parameters (mu and sigma) are known, what you are looking for is the norminv function (official documentation: https://mathworks.com/help/stats/norminv.html).
X = norminv(P,mu,sigma) computes the inverse of the normal CDF using
the corresponding mean mu and standard deviation sigma at the
corresponding probabilities in P. The parameters in sigma must be
positive, and the values in P must lie in the interval [0 1].
For example, this is how you find the interval that contains 95% of the values of a standard normal distribution:
norminv([0.025 0.975],0,1)
This is how you find the 99th percentile of a normal distribution with mu=10 and sigma=3.5:
norminv(0.99,10,3.5)
If you don't know those parameters, you can estimate them from the data you actually have. The parameters of the normal family are the mean and the standard deviation; once they are known, the underlying distribution is fully described. Actually:
The mean of a normal distribution is halfway between the 25th and the 75th percentile. Average these two values to approximate it.
In a normal distribution, the difference between the 25th and the 75th percentile is about 1.35 times its standard deviation. So take the difference between the aforementioned values and divide it by 1.35 in order to obtain an approximation of the standard deviation.
If you want to go with a linear interpolation, have a look at interp1 (https://mathworks.com/help/matlab/ref/interp1.html).
What test can I do in MATLAB to test the spread of a histogram? For example, in the given set of histograms, I am only interested in 1,2,3,5 and 7 (going from left to right, top to bottom) because they are less spread out. How can I obtain a value that will tell me if a histogram is positively skewed?
It may be possible using Chi-Squared tests but I am not sure what the MATLAB code will be for that.
You can use the standard definition of skewness. In other words, you can use:
You compute the mean of your data and you use the above equation to calculate skewness. Positive and negative skewness are like so:
Source: Wikipedia
As such, the larger the value, the more positively skewed it is. The more negative the value, the more negatively skewed it is.
Now to compute the mean of your histogram data, it's quite simple. You simply do a weighted sum of the histogram entries and divide by the total number of entries. Given that your histogram is stored in h, the bin centres of your histogram are stored in x, you would do the following. What I will do here is assume that you have bins from 0 up to N-1 where N is the total number of bins in the histogram... judging from your picture:
x = 0:numel(h)-1; %// Judging from your pictures
num_entries = sum(h(:));
mu = sum(h.*x) / num_entries;
skew = ((1/num_entries)*(sum((h.*x - mu).^3))) / ...
((1/(num_entries-1))*(sum((h.*x - mu).^2)))^(3/2);
skew would contain the numerical measure of skewness for a histogram that follows that formula. Therefore, with your problem statement, you will want to look for skewness numbers that are positive and large. I can't really comment on what threshold you should look at, but look for positive numbers that are much larger than most of the histograms that you have.
I'd like to make an array of random samples from a Gaussian distrubution.
Mean value is 0 and variance is 1.
If I take enough samples, I would think my maximum value of a sample can be 0+1=1.
However, I find that I get values like 4.2891 ...
My code:
x = 0+sqrt(1)*randn(100000,1);
mean(x)
var(x)
max(x)
This would give me a mean like 0, a var of 0.9937 but my maximum value is 4.2891?
Can anyone help me out why it does this?
As others have mentioned, there is no bound on the possible values that x can take on in a gaussian distribution. However, the farther x is from the mean, the less likely it is to be drawn.
To give you some intuition for what the variance actually means (for any distribution, not just the gaussian case), you can look at the 68-95-99.7 rule. The rule says:
about 68% of the population will be within one sigma of the mean
about 95% of the population will be within two sigma's of the mean
about 99.7% of the population will be within three sigma's of the mean
Here sigma = sqrt(var) is the standard deviation of the distribution.
So while in theory it is possible to draw any x from a gaussian distribution, in practice it is unlikely to draw anything past 5 or 6 standard deviations away for a population of 100000.
This will yield N random numbers using the gaussian normal distribution.
N = 100;
mu = 0;
sigma = 1;
Xs = normrnd(mu, sigma, N);
EDIT:
I just realized that your code is in fact equivalent to what I've written.
As others already pointed out: variance is not the maximum distance a sample can deviate from the mean! (It is just the average of the squares of those distances)
Even with a simple classifier like the nearest neighbour I cannot seem to judge its accuracy and thus cannot improve it.
For example with the code below:
IDX = knnsearch(train_image_feats, test_image_feats);
predicted_categories = cell([size(test_image_feats, 1), 1]);
for i=1:size(IDX,1)
predicted_categories{i}=train_labels(IDX(i));
end
Here train_image_feats is a 300 by 256 matrix where each row represents an image. Same is the structure of test_image_feats. train_labels is the label corresponding to each row of the training matrix.
The book I am following simply said that the above method achieves an accuracy of 19%.
How did the author come to this conclusion? Is there any way to judge the accuracy of my results be it with this classifier or other?
The author then uses another method of feature extraction and says it improved accuracy by 30%.
How can I find the accuracy? Be it graphically or just via a simple percentage.
Accuracy when doing machine learning and classification is usually calculated by comparing your predicted outputs from your classifier in comparison to the ground truth. When you're evaluating the classification accuracy of your classifier, you will have already created a predictive model using a training set with known inputs and outputs. At this point, you will have a test set with inputs and outputs that were not used to train the classifier. For the purposes of this post, let's call this the ground truth data set. This ground truth data set helps assess the accuracy of your classifier when you are providing inputs to this classifier that it has not seen before. You take your inputs from your test set, and run them through your classifier. You get outputs for each input and we call the collection of these outputs the predicted values.
For each predicted value, you compare to the associated ground truth value and see if it is the same. You add up all of the instances where the outputs match up between the predicted and the ground truth. Adding all of these values up, and dividing by the total number of points in your test set yields the fraction of instances where your model accurately predicted the result in comparison to the ground truth.
In MATLAB, this is really simple to calculate. Supposing that your categories for your model were enumerated from 1 to N where N is the total number of labels you are classifying with. Let groundTruth be your vector of labels that denote the ground truth while predictedLabels denote your labels that are generated from your classifier. The accuracy is simply calculated by:
accuracy = sum(groundTruth == predictedLabels) / numel(groundTruth);
accuracyPercentage = 100*accuracy;
The first line of code calculates what the accuracy of your model is as a fraction. The second line calculates this as a percentage, where you simply multiply the first line of code by 100. You can use either or when you want to assess accuracy. One is just normalized between [0,1] while the other is a percentage from 0% to 100%. What groundTruth == predictedLabels does is that it compares each element between groundTruth and predictedLabels. If the ith value in groundTruth matches with the ith value in predictedLabels, we output a 1. If not, we output a 0. This will be a vector of 0s and 1s and so we simply sum up all of the values that are 1, which is eloquently encapsulated in the sum operation. We then divide by the total number of points in our test set to obtain the final accuracy of the classifier.
With a toy example, supposing I had 4 labels, and my groundTruth and predictedLabels vectors were this:
groundTruth = [1 2 3 2 3 4 1 1 2 3 3 4 1 2 3];
predictedLabels = [1 2 2 4 4 4 1 2 3 3 4 1 2 3 3];
The accuracy using the above vectors gives us:
>> accuracy
accuracy =
0.4000
>> accuracyPercentage
accuracyPercentage =
40
This means that we have a 40% accuracy or an accuracy of 0.40. Using this example, the predictive model was only able to accurately classify 40% of the test set when you put each test set input through the classifier. This makes sense, because between our predicted outputs and ground truth, only 40%, or 6 outputs match up. These are the 1st, 2nd, 6th, 7th, 10th and 15th elements. There are other metrics to calculating accuracy, like ROC curves, but when calculating accuracy in machine learning, this is what is usually done.
I have two matrixesthat belongs gaussian distribtion.The size are 3x3. Now, I want to estimate up and down threshold of their matrixes. I denote mean and standard deviation of each matrix is μ1;μ2 and σ1;σ2. The high and low threshold are
T_hight=(μ1+μ2)./2+k1∗(σ1+σ2)./2
T_low=(μ1+μ2)./2-k2∗(σ1+σ2)./2
where k1,k2 is const
My question is "Is my formula correct? Because this is gaussian distribution, so k1=k2,Right? And this is my code. Could you check have me"
μ1=mean(v1(:));first matrix
σ1=std2(v1(:));
μ2=mean(v2(:));second matrix
σ2=std2(v2(:));
k1=k2=1;
T_hight=(μ1+μ2)./2+k1∗(σ1+σ2)./2;
T_low=(μ1+μ2)./2-k2∗(σ1+σ2)./2;
In the formula you are using, the joint standard deviation is wrong.it should be
T_high=(μ1+μ2)./2+k1∗sqrt((σ1^2+σ2^2)/2);
T_low=(μ1+μ2)./2-k2∗sqrt((σ1^2+σ2^2)/2);
As you treat all 18 pixels as belonging to the same distribution, why not use the following
v=[v1(:);v2(:)];
μ=mean(v);
σ=std(v);
k1=k2=1;
T_high=μ+k1*σ;
T_low=μ-k2∗σ1;