I am trying to implement naive bayes classifier and really confused problem of laplace smoothing.
The probability of get word in class C is:
<pre>
P(Wi|C) = (count(Wi,C) + 1) / (count(all, C) + |V|)
</pre>
But what is V? Is it vocabulary of only training corpus? Or V is whole english vocabulary?
It should be the vocabulary of the training corpus.
Laplace smoothing in Naive Bayes is used to maintain Bias- variance trade off or over fitting - under fitting problem.
It adds hyper parameter (Alpha) to your numerator and denominator field to your formula. You have to tune this parameter for choosing better model using GridSearch or RandomSearch techniques. https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624
Related
I am reading a statistics textbook Introduction to Statistics for Engineers by Sheldon Ross, p.275 and trying to re-do its examples on paper and in Octave. I am not able to replicate many Bayes calculations in Octave when it comes to the integration part. Please advise how to go about replicating below calculation in Octave? Below is a simple Bayes estimator example which naturally becomes a symbolic integration problem, where I often encounter difficulty doing in Octave.
[Clarification: This calculation is from a textbook and I understand it by hand. What I don't understand is how should one approach such statistical computing exercises in practice. This question relates to statistical/scientific computing, not coding or statistics per se.]
Suppose are independent Bernoulli variables, having pdf
p is the unknown variable .
Compute the Bayes estimator for p.
We know that
The conditional pdf of p given X is then
It can be shown that,
---(1)
Using (1) and letting , the conditional pdf becomes
Recall Bayes estimator is .
Therefore, Bayes estimator for p is:
Now, I try to replicate these steps using Octave as below and failed (integration took 40mins on my $2500 Dell desktop). Can you show my confused soul how do you do the above steps in Octave or Matlab or R to arrive at the same Bayes estimator?
#Use Octave to derive the above Bayes estimator
pkg load symbolic;
syms p n x;
f = (p^x) * (1-p)^(n-x);
F = int(f, p, [0, 1]); #integrate f, which gives the conditional pdf denominator
f_conditional = f/F; #the conditional pdf
integrand = p * f_conditional; # the integrand to derive Bayes estimator
estimator = int(integrand, p, [0, 1]);
#this integration takes forever, how else should I replicate the above in Octave?
I took the matlab code from this tutorial Texture Segmentation Using Gabor Filters.
To test clustering algorithms on the resulting multi-dimensional texture responses to gabor filters, I applied Gaussian Mixture and Fuzzy C-means instead of the K-means to compare their results (number of clusters = 2 in all of the cases):
Original image:
K-means clusters:
L = kmeans(X, 2, 'Replicates', 5);
GMM clusters:
options = statset('MaxIter',1000);
gmm = fitgmdist(X, 2, 'Options', options);
L = cluster(gmm, X);
Fuzzy C-means:
[centers, U] = fcm(X, 2);
[values indexes] = max(U);
What I've found weird in this case is that K-means clusters are more accurate than those extracted using GMM and Fuzzy C-means.
Can anyone explain to me if the high-dimensionality (L x W x 26: 26 is the number of gabor filters used) of the data given as input to the GMM and the Fuzzy C-means classifiers is what's causing the clustering to be less accurate?
In other words is the GMM and the Fuzzy C-means clustering more sensitive to the dimensionality of the data, than K-means is?
Glad the comment was useful, here are my observations in answer form.
Each of these methods are sensitive to initialization, but k-means is cheating by using 5 'Replicates' and higher quality initialization (k-means++). The rest of the methods appear to be using a single random initialization.
k-means is GMM if you force spherical covariance. So in theory, it shouldn't do much better (it might do slightly better if the true covariance was in fact spherical).
I think most of the discrepancy comes down to initialization. You should be able to test this by using the k-means result as initial conditions for the other algorithms. Or as you tried, run several times using different random seeds and check if there is more variation in GMM and Fuzzy C-means than there is in k-means.
How could I decompose a two-peaked (empirical) pdf into 2 say lognormals or other appropriate pdf in a straightforward way? I'd prefer in Matlab.
to something like this:
Thanks!
What you are looking for is called a mixture density, defined as p(x) = sum_i a_i p_i(x), where sum_i a_i = 1 and each p_i(x) is itself a density function. The most widely-used such model is the Gaussian mixture density, in which each p_i(x) is a Gaussian density; I think there are functions to find the parameters for that in some Matlab package. More generally, the p_i(x) can be any density. The customary algorithm to fit the parameters is the expectation-maximization algorithm. A web search should turn up a lot of references and probably some Matlab code for that as well.
I want to train a SVM with non-linear boundary. The boundary is known, expressed with formula
y = sgn( (w11*x1+ w12*x2 + w13*x3)* (w21*x4+ w22*x5 + w23*x6) ), where [x1 x2 ... x6] are 1-bit inputs, [w11 w12 w13 w21 w22 w23] are unknown parameters.
How can I learn [w11 w12 w13 w21 w22 w23] with train data?
SVM is not an algorithm for such task. SVM has its own criterion to maximize, which has nothing to do with the decision boundary shape (ok, not nothing, but it is hard to convert one to another). Obviously, one can try to predefine custom kernel function to do so, but this task seems as almost unsolvable problem (I can't think of any reproducing hilbert space with such decision boundaries).
In short: your question is a bit like "how to make a watermelon remove nails from the wall?". Obviously - you can do some pretty hard "magic" to do so, but this is not what watermelons are for.
I have a Binary classification problem that I need to do in MATLAB. There are two classes and the training data and testing data problems are from two classes and they are 2d coordinates drawn from Gaussian distributions.
The samples are 2D points and they are something like these (1000 samples for class A and 1000 samples for class B):
I am just posting some of them here:
5.867766 3.843014
5.019520 2.874257
1.787476 4.483156
4.494783 3.551501
1.212243 5.949315
2.216728 4.126151
2.864502 3.139245
1.532942 6.669650
6.569531 5.032038
2.552391 5.753817
2.610070 4.251235
1.943493 4.326230
1.617939 4.948345
If a new test data comes in, how should I classify the test sample?
P(Class/TestPoint) is proportional to P(TestPoint/Class) * (ProbabilityOfClass).
I am not sure of how we compute the P(Sample/Class) variable for the 2D coordinates given. Right now, I am using the formula
P(Coordinates/Class) = (Coordinates- mean for that class) / standard deviation of points in that class).
However, I am not getting very good test results with this. Am I doing anything wrong?
That is the good method, however the formula is not correct, look at the multivariate gaussian distribution article on wikipedia:
P(TestPoint|Class)=
,
where is the determinant of A.
Sigma = classPoint*classPoint';
mu = mean(classPoint,2);
proba = 1/((2*pi)^(2/2)*det(Sigma)^(1/2))*...
exp(-1/2*(testPoint-mu)*inv(Sigma)*(testPoint-mu)');
In your case, since they are as many points in both class, P(class)=1/2
Assuming your formula is correctly applied, another issue could be the derivation of features from your data points. Your problem might not be suited for a linear classifier.