I worked on the problem of handwritten recognition images. For this, I use support vector machines as a classifier . the matrix score shows an example of the scores returned by svm for 5 samples. the number of classes is also 5. I want to transform this matrix into probabilities.
score=[ 0,2590 -0,6033 -1,1350 -1,2347 -0,9776
-1,4727 -0,2136 -0,9649 0,1480 -1,4761
-0,9637 -0,8662 0,0674 -1,0051 -1,1293
-2,1230 -0,8805 -0,9808 -0,0520 -0,0836
-1,6976 -1,1578 -0,9205 -1,1101 1,0796]
According to research on existing methods, I found that the Platt's scaling method is most appropriate in my case. I found an implementation of this method on this link Platt scaling but the problem is that I don't understand the third parameter to enter. Please, help me to understand this implementation and to make it executable
I await your answers and thank you in advance
Related
I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.
The second output of the libsvmread command is a set of features for each given training example.
For example, in the following MATLAB command:
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
This second variable (heart_scale_inst) holds content in a form that I don't understand, for example:
<1, 1> -> 0.70833
What is the meaning of it? How is it to be used (I can't plot it, the way it is)?
PS. If anyone could please recommend a good LIBSVM tutorial, I'd appreciate it. I haven't found anything useful and the README file isn't very clear... Thanks.
The definitive tutorial for LIBSVM for beginners is called: A Practical Guide to SVM Classification it is available from the site of the authors of LIBSVM.
The second parameter returned is called the instance matrix. It is a matrix, let call it M, M(1,:) are the features of data point 1 and so on. The matrix is sparse that is why it prints out weirdly. If you want to see it fully print full(M).
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
with heart_scale_label and heart_scale_inst you should be able to train an SVM by issuing:
mod = svmtrain(heart_scale_label,heart_scale_inst,'-c 1 -t 0');
I strong suggest you read the above linked guide to learn how to set the c parameter (and possibly, in case of RBF kernel the gamma parameter), but the above line is how you would train with that data.
I think it is the probability with which test case has been predicted to heart_scale label category
Recently I'm doing some training of HMM, I used the HMM toolbox. But I have some problems and couldn't resolve them.
I train my hmm as shown below. There's no problems here.
[LL, prior1, transmatrix1, observematrix1] = dhmm_em(data, prior0, transmatrix0, observematrix0);
I use the Viterbi algorithm to find the most-probable path through the HMM state trellis.
function path = viterbi_path(prior, transmat, obslik);
Now there's a problem. I don't know what the "obslik" means. Is it the observematrix1?
I want to get the probability of a sequence, but I don't know whether I should use the "fwdback" function or not. If I should, what the "obslik" means then?
function [alpha, beta, gamma, loglik, xi_summed, gamma2] = fwdback(init_state_distrib, transmat, obslik, varargin);
Thanks!!!
I didn't understand the comments. Now I understand it.
The "obslik" here isn't equal to the observematrix1. Before using Viterbi_path function, we should compute the obslik:
obslik = multinomial_prob(data(m,:), observematrix1);
the data matrix is the observematrix0, observe-matrix before training.
Am I right?
i've this photo :
and i'm trying to make Document binarization using niblack algorithm
i've implemented the simple Niblack algorithm
T = mean + K* standardDiviation
and that was it's result:
the problem is there's some parts of the image in which the window doesn't contain any objects so it detects the noise as objects and elaborates them .
i tried to apply blurring filter then global thresholding
that was the result :
which wont be solved by any other filter
i guess the only solution is preventing the algorithm from detecting global noise if the window i free from object
i'm interested to do this using niblack algorithm not using other algorithm so any suggestions ?
i tried sauvola algorithm in this paper Adaptive document image binarization J. Sauvola*, M. PietikaKinen section 3.3
it's a modified version of niblack algorithm which uses a modified equation of niblack
which returned a pretty good answers :
as well as i tried another modification of Niblack which is implemented in this paper
in the 5.5 Algorithm No. 9a: Université de Lyon, INSA, France (C. Wolf, J-M Jolion)
which returned a good results as well :
Did you look here: https://stackoverflow.com/a/9891678/105037
local_mean = imfilter(X, filt, 'symmetric');
local_std = sqrt(imfilter(X .^ 2, filt, 'symmetric'));
X_bin = X >= (local_mean + k_threshold * local_std);
I don't see many options here if you insist to use niblack. You can change the size and type of the filter, and the threshold.
BTW, it seems that your original image has colors. This information can significantly improve black text detection.
There are range of methods that can help in this situation:
Of course, you can change algorithm it self =)
Also it is possible just apply morphology filters: first you apply maximum in the window, and after - minimum. You should tune windows size to achieve a better result, see wiki.
You can choose the hardest but the better way and try to improve Niblack's scheme. It is necessary to increase Niblack's windows size if standard deviation is smaller than some fixed number (should be tuned).
i tried the niblack algorithm with k=-0.99 and windows=990 using optimisation:
Shafait – “Efficient Implementation of Local Adaptive Thresholding
Techniques Using Integral Images”, 2008
with : T = mean + K* standardDiviation; i have this result :
the implementation of algorithm is taken here
I need to calculate Tanimoto Coefficient. I don't know what's wrong in my code. I have 2 nearly similar images. But the value obtained using my code indicates that the two images are highly dissimilar. Kindly help me with my code.
%Tanimoto coeff
I=imread('sliver3.jpg');
J=imread('ref5.jpg');
figure,imshow(I),title('Original');
figure,imshow(J),title('Reference');
inter=intersect(I,J,'rows');
uni=union(I,J,'rows');
si=size(inter);
su=size(uni);
tc=si/su
I attach three images here. First one is the segmented output. Second is a reference image.Third is also a reference but which is highly dissimilar. So, the output must be that, first and second must be nearly similar and first and third must be highly dissimilar. But I'm getting the opposite.
For first two images, tc =0.4895
For first and third images, tc=0.5692
Kindly help me out.
I think you should use the sum() function on union and intersect instead of size() since the Tanimoto Coefficient is the "summation of the intersect"/"summation of the union"