How does Orange calculate confidence intervals in its Distribution widget? - orange

When using the Distribution widget of Orange on a binary classification dataset there's the option of showing confidence intervals for the probabilities of a given class label for all feature values, see: Distribution Widget Doc
How are these intervals calculated? I've tried searching the github repo using keywords: 'distribution', 'confidence interval'. But have only found the code for the widget UI and no pointers to where the actual stats are calculated.

It's done in the calcHistogramAndProbGraph method of OWDistributions.py (code), which is the code for the distributions widget.
For discrete features it's simply the observed ratio. For continuous features it calls out to C++ code that (I assume) discretizes the feature and estimates the probability in a similar fashion.

Related

Episodic Semi-gradient Sarsa with Neural Network

While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just calculate the estimated value of the different actions in the current state and choose the one which gives the maximimum. But this seems to be not the best way of solving the problem. Furthermore, it does not work if the action space can be continous (like the acceleration of a self-driving car for example).
So, basicly I am wondering how to solve the 10th line Choose A' as a function of q(S', , w) in this pseudo-code of Sutton:
How are these problems typically solved? Can one recommend a good example of this algorithm using Keras?
Edit: Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
I wondered how I choose the optimal action based on the currently learned weights of the network
You have three basic choices:
Run the network multiple times, once for each possible value of A' to go with the S' value that you are considering. Take the maximum value as the predicted optimum action (with probability of 1-ε, otherwise choose randomly for ε-greedy policy typically used in SARSA)
Design the network to estimate all action values at once - i.e. to have |A(s)| outputs (perhaps padded to cover "impossible" actions that you need to filter out). This will alter the gradient calculations slightly, there should be zero gradient applied to last layer inactive outputs (i.e. anything not matching the A of (S,A)). Again, just take the maximum valid output as the estimated optimum action. This can be more efficient than running the network multiple times. This is also the approach used by the recent DQN Atari games playing bot, and AlphaGo's policy networks.
Use a policy-gradient method, which works by using samples to estimate gradient that would improve a policy estimator. You can see chapter 13 of Sutton and Barto's second edition of Reinforcement Learning: An Introduction for more details. Policy-gradient methods become attractive for when there are large numbers of possible actions and can cope with continuous action spaces (by making estimates of the distribution function for optimal policy - e.g. choosing mean and standard deviation of a normal distribution, which you can sample from to take your action). You can also combine policy-gradient with a state-value approach in actor-critic methods, which can be more efficient learners than pure policy-gradient approaches.
Note that if your action space is continuous, you don't have to use a policy-gradient method, you could just quantise the action. Also, in some cases, even when actions are in theory continuous, you may find the optimal policy involves only using extreme values (the classic mountain car example falls into this category, the only useful actions are maximum acceleration and maximum backwards acceleration)
Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
No. There is no separate loss function in the pseudocode, such as the MSE you would see used in supervised learning. The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect. Literally the term ∇q(S,A,w) (sorry for missing hat, no LaTex on SO) means the gradient of the estimator itself - not the gradient of any loss function.

IBM Bluemix - Visual Recognition. Why low scores?

I’m using Visual Recognition service on IBM Bluemix.
I have created some classifiers, in particular two of these with this objective:
first: a “generic” classifier that has to return the score of confidence about the recognition of a particular object in the image. I’ve trained it with 50 positive examples of the object, and 50 negative examples of something similar to the object (details of it, its components, images alike it etc.).
second: a more specific classifier that recognize the particular type of the object identified before, if the score of the first classification is quite high. This new classifier has been trained as the first one: 50 positive examples of type A object, 50 negative examples of type B object. This second categorization should be more specific that the first one, because the images are more detailed and are all similar among them.
The result is that the two classifiers work well, and the expected results of a particular set of images correspond to the truth in most cases, and this should mean that both have been well trained.
But there is a thing that I don’t understand.
In both classifiers, if I try to classify one of the images that have been used in the positive training set, my expectation is that the confidence score should be near to 90-100%. Instead, I always obtain a score that is included in the range between 0.50 and 0.55. Same thing happens when I try with an image very similar to one of the positive training set (scaled, reflected, cut out etc.): the confidence never goes above 0.55 circa.
I’ve tried to create a similar classifier with 100 positive images and 100 negative images, but the final result never change.
The question is: why the confidence score is so low? why it is not near to 90-100% with images used in the positive training set?
The scores from Visual Recognition custom classifiers range from 0.0 to 1.0, but they are unitless and are not percentages or probabilities. (They do not add up to 100% or 1.0)
When the service creates a classifier from your examples, it is trying to figure out what distinguishes the features of one class of positive_examples from the other classes of positive_examples (and negative_examples, if given). The scores are based on the distance to a decision boundary between the positive examples for the class and everything else in the classifier. It attempts to calibrate the score output for each class so that 0.5 is a decent decision threshold, to say whether something belongs to the class.
However, given the cost-benefit balance of false alarms vs. missed detections in your application, you may want to use a higher or lower threshold for deciding whether an image belongs to a class.
Without knowing the specifics of your class examples, I might guess that there is a significant amount of similarity between your classes, that maybe in the feature space your examples are not in distinct clusters, and that the scores reflect this closeness to the boundary.

Best feature selection method for wholeslide images

I have been working on extracting features from wholelside images in MATLAB. In that, i have so far succeeded in extracting 118 features from 92 training images. I want to select the feature that provides best two-class separability, the two classes being 'Infected' and 'Normal'. I evaluated the ks-density plot of each of the feature but couldn't get a numerical measure of the amount of overlap between the pdfs of a given feature from 'Normal' class and 'Infected' class.
Further, I tried ranking the features based on the method given in this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0082710#pone.0082710-Newcombe1.
But this method seems to point the best feature as something which has considerable amount of overlap of pdfs between the two classes. When I test the accuracy of the method by measuring the Bhattacharya distance between the feature vector of the two classes, it shows some other set of features that are best.
I am not able to decide which method to use for feature selection. Can someone give me some guidance?
My goal is to correlate the amount of overlap between the pdfs of the feature vector of the two classes with the results of a proper feature selection algorithm. In other words, what the chosen algorithm points out as the best feature for classification should have the least pdf overlap.<\p>
If also someone can give me guidance on how I can estimate the amount of overlap between a given feature vector of class 'Normal' and the same feature vector of class 'Infected', that would be great.

KNN Classifier using cross validation

I am trying to implement KNN classifier using the cross validation approach where I have different images of a certain character for training(e.g 5 images) and another two for testing. Now I get the idea of the cross validation by simply choosing the K with the least error value when training & then using it with the test data to find how accurate my results are.
My question is how do I train images in matlab to get my K value? Do I compare them and try to find mismatch or what?!
Any help would be really appreciated.
First of you need to define your task precisely. F.ex Given an image I in R^(MxN) we wish to classify I as an image containing faces or an image without faces.
I often work with pixel classifiers, where the task is something like: For an image I decide if each pixel is a face pixel or a non-face pixel.
An important part of defining the task is to make a hypotheses that can be used as basis for training a classifier. F.ex We believe that the distribution of pixel intensities can be used to discriminate images of faces from images not containing faces.
Then you need to select some features that define your image. This can be done in many ways and you should search for what other people do when they analyse the same type of images you are working with.
One widely used method in pixel classification is to use pixel intensity values and do a multi-scale analysis of the image. The idea in multi-scale analysis is that different structures are most evident at different level of blurring called scales. As an illustration consider an image of a tree. Without blurring we notice the fine structure, such as small branches and leafs. When we blur the image we notice the trunk and major branches. This is often used as part of segmentation methods.
When you know your task and the features, you can train a classifier. If you use kNN and cross-validation to find the best k, you should split you dataset in train/testing and then split the training set in train/validate sets. You then train using the reduced training set and use the validation set to decide which k is the best. In the case of binary classification e.g face vs non-face the error rate is often used as a measure of performance.
Finally you use the parameters to train the classifier on the full dataset and estimate its performance on the test set.
A classification example: With or without milk?
As a full example, consider images of a cup of coffee taken from above so it shows the rim of the cup surrounding a brownly colored disk. Further assume that all images are scaled and cropped so the diameter of the disk is the same and dimensions of the image are the same. To simplify the task, we convert the color image to grayscale and scale the pixel intensities to the range [0,1].
We want to train a classifier so it can distinguish coffee with milk from coffee without milk. From inspection of histograms of some of the coffee images, we see that each image has two "bumps" in the histogram that are clearly separated. We believe that these bumps correspond to foreground (coffee) and background. Now we make the hypothesis that the average intensity of the foreground can be used to distinguish between coffee+milk/coffee.
To find the foreground pixels we observe that because the foreground/background ratio is the same (by design) we can just find the intensity value that gives us that ratio for each image. Then we calculate the average intensity of the foreground pixels and use this value as a feature for each image.
If we have N images that we have manually labeled, we split this into training and test set. We then calculate the average foreground intensity for each image in the training set, giving us a set of (average foreground intensity, label) values. We want to use kNN where an image is assigned the same class as the majority class of the k closest images. We measure the distance as the absolute value of the difference in average foreground pixel intensity.
We search for the optimal k with cross validation. We use 2-fold cross validation (aka holdout) to find the best k. We test k = {1,3,5} and select the k that gives the least prediction error on the validation set.

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.