Jahmm lib: how to interpret negative value from ForwardBackwardScaledCalculator.lnProbability()? - classification

I use the Jahmm library for classification of accelerometer sequences.
I have created my models but when i try to calculate the proibablity of a test sequence on a model by:
ForwardBackwardScaledCalculator fbsc = new ForwardBackwardScaledCalculator(test_pair.getValue(),model_pair.getValue().get_hmm());
System.out.println(fbsc.lnProbability());
I get negative values like -1278.0926336276573.
The comment in the code of the library states that the lnProbability method:
Return the napierian logarithm of the probability of the sequence that
generated this object.
Returns: The probability of the sequence of interest's napierian
logarithm
But how to compare two of such logarithms? I call the method on two different models with the two test sequences so i get 4 probabilities:
The test sequence: fast_test.seq on fast_model yields a Napierian log from -1278.0926336276573
The test sequence: fast_test.seq on slow_model yields a Napierian log from -1862.6947488370433
The test sequence: slow_test.seq on fast_model yields a Napierian log from -4433.949818774553
The test sequence: slow_test.seq on slow_model yields a Napierian log from -4208.071445499895
But in this context, does it mean that the closer we get to zero, the more similar the test sequence is to the model (so in this example the classification accuracy = 100%?)
Thank you

If by "Napierian logarithm", the natural logarithm is meant, then you can get a probability from a return value x by raising e to the x, e.g. using Math.exp. However, the reason that logarithms are returned is because the probability values are too small to represent in a double; Math.exp(-1278.0926336276573) will simply return zero. See the Wikipedia article about log probabilities.
does it mean that the closer we get to zero, the more similar the test sequence is to the model
exp(0) == 1 and log(1) == 0, and indeed the lower the probability, the smaller (more negative) its logarithm. So, the closer you get to zero, the more probable the sentence is under the model.
However, this need not directly relate to "similarity to a model", let alone "classification accuracy", since HMMs (being generative models) will ascribe lower probability to longer sequences. Read up on HMMs in your favorite textbook; a full explanation would be too long for this answer box and is a math question, so off-topic for this website.

Related

Neutrality for sentiment analysis in spark

I have built a pretty basic naive bayes over apache spark and using mllib of course. But I have a few clarifications on what exactly neutrality means.
From what I understand, in a given dataset there are pre-labeled sentences which comprise of the necessary classes, let's take 3 for example below.
0-> Negative sentiment
1-> Positive sentiment
2-> Neutral sentiment
This neutral is pre-labeled in the training set itself.
Is there any other form of neutrality handling. Suppose if there are no neutral sentences available in the dataset then is it possible that I can calculate it from the scale of probability like
0.0 - 0.4 => Negative
0.4- - 0.6 => Neutral
0.6 - 1.0 => Positive
Is such kind of mapping possible in spark. I searched around but could not find any. The NaiveBayesModel class in the RDD API has a predict method which just returns a double that is mapped according to the training set i.e if only 0,1 is there it will return only 0,1 and not in a scaled manner such as 0.0 - 1.0 as above.
Any pointers/advice on this would be incredibly helpful.
Edit - 1
Sample code
//Performs tokenization,pos tagging and then lemmatization
//Returns a array of string
val tokenizedString = Util.tokenizeData(text)
val hashingTF = new HashingTF()
//Returns a double
//According to the training set 1.0 => Positive, 0.0 => Negative
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Sample dataset content
1,Awesome movie
0,This movie sucks
Of course the original dataset contains more longer sentences, but this should be enough for explanations I guess
Using the above code I am calculating. My question is the same
1) Neutrality handling in dataset
In the above dataset if I am adding another category such as
2,This movie can be enjoyed by kids
For arguments sake, lets assume that it is a neutral review, then the model.predict method will give either 1.0,0.0,2.0 based on the passed in sentence.
2) Using the model.predictProbabilities it gives an array of doubles, but I am not sure in what order it gives the result i.e index 0 is for negative or for positive? With three features i.e Negative,Positive,Neutral then in what order will that method return the predictions?
It would have been helpful to have the code that builds the model (for your example to work, the 0.0 from the dataset must be converted to 0.0 as a Double in the model, either after indexing it with a StringIndexer stage, or if you converted that from the file), but assuming that this code works:
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Then yes, it means the probabilities at index 0 is that of negative and at 1 that of positive (it's a bit strange and there must be a reason, but everything is a double in ML, even feature and category indexes). If you have something like this in your code:
val labelIndexer = new StringIndexer()
.setInputCol("sentiment")
.setOutputCol("indexedsentiment")
.fit(trainingData)
Then you can use labelIndexer.labels to identify the labels (probability at index 0 is for labelIndexer.labels at index 0.
Now regarding your other questions.
Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.
A Neutral category can be very helpful if you want to manage Type 2. If that is the case, you need neutral examples in your dataset. Naive Bayes is not a good classifier to apply thresholding on the probabilities in order to determine Type 2 neutrality.
Option 1: Build a dataset (if you think you will have to deal with a lot of Type 2 neutral texts). The good news is, building a neutral dataset is not too difficult. For instance you can pick random texts that are not movie reviews and assume they are neutral. It would be even better if you could pick content that is closely related to movies (but neutral), like a dataset of movie synopsis. You could then create a multi-class Naive Bayes classifier (between neutral, positive and negative) or a hierarchical classifier (first step is a binary classifier that determines whether a text is a movie review or not, second step to determine the overall sentiment).
Option 2 (can be used to deal with both Type 1 and 2). As I said, Naive Bayes is not very great to deal with thresholds on the probabilities, but you can try that. Without a dataset though, it will be difficult to determine the thresholds to use. Another approach is to identify the number of words or stems that have a significant polarity. One quick and dirty way to achieve that is to query your classifier with each individual word and count the number of times it returns "positive" with a probability significantly higher than the negative class (discard if the probabilities are too close to each other, for instance within 25% - a bit of experimentations will be needed here). At the end, you may end up with say 20 positive words vs 15 negative ones and determine it is neutral because it is balanced or if you have 0 positive and 1 negative, return neutral because the count of polarized words is too low.
Good luck and hope this helped.
I am not sure if I understand the problem but:
prior in Naive Bayes is computed from the data and cannot be set manually.
in MLLib you can use predictProbabilities to obtain class probabilities.
in ML you can use setThresholds to set prediction threshold for each class.

Are the conditional probabilities of MATLAB's mnrval correct?

An MWE (stats toolbox required, tested on MATLAB R2014b):
x = (1:3)';
b = mnrfit(x,x,'model','hierarchical');
pihat = mnrval(b,x,'model','hierarchical','type','conditional')
Output:
pihat =
1 1
2.2204e-16 1
2.2204e-16 2.2204e-16
(Ignore the issued warning, it's because of the trivial example, which is linearly separable (I'm predicting x using itself). It doesn't matter: I've tried this as well with a non-trivial (and not-so minimal) example without the warning and the results are similar.)
My problem is the result. I've specified I want the conditional probabilities. According to MATLAB's documentation on mnrval:
Specify ['conditional'] to return predictions [...] in terms of the first k – 1 conditional category probabilities [...], i.e., the probability [...] for category j, given an outcome in category j or higher.
In my example this means rows of pihat contain the probability of
x=1 given x>=1
x=2 given x>=2
(A third column for x=3 is not necessary, because if the first two probabilities are known, the third is too. It follows logically from P(x=1) + P(x=2) + P(x=3) = 1.)
Am I interpreting this correctly? Thus, if x=1 is predicted, then the first column value should be large (close to one), because P(x=1) given x>=1 is large. The second column should be close to zero, because P(x=2) given x>=2 can't be large if x=1.
However, as you can see in the first row, the second column value is large as well as the first! I believe this is incorrect according to what the documentation specifies, am I right? The current (incorrect?) result implies the predicted probabilities in the rows are not of x=j given x>=j, but what are they then? Or how should I be interpreting them?
They are not equal to the cumulative probabilities, i.e. the probability of x<=j, which increases with j. I've checked this by calculating pihat2 = mnrval(b,x,'model','hierarchical','type','cumulative'); pihat2-pihat.

How do I interpret pycaffe classify.py output?

I created a GoogleNet Model via Nvidia DIGITS with two classes (called positive and negative).
If I classify an image with DIGITS, it shows me a nice result like positive: 85.56% and negative: 14.44%.
If it pass that model it into pycaffe's classify.py with the same image, I get a result like array([[ 0.38978559, -0.06033826]], dtype=float32)
So, how do I read/interpret this result? How do I calculate the confidence levels (not sure if this is the right term) shown by DIGITS from the results shown by classify.py?
This issue led me to the solution.
As the log shows, the network produces three outputs. Classifier#classify only returns the first output. So e.g. by changing predictions = out[self.outputs[0]] to predictions = out[self.outputs[2]], I get the desired values.

Assessing performance of a zero inflated negative binomial model

I am modelling the diffusion of movies through a contact network (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below.)
The next step is to evaluate the performance of the model.
My attempt has been to do multiple out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I received a prediction for the mean length of a diffusion chain for each datapoint, and using
predict(m1, newdata = testData, type = "prob")
I received a matrix containing the probability of each datapoint being a certain length.
Problem with the evaluation: Since I have a 0 (and 1) inflated dataset, the model would be correct most of the time if it predicted 0 for all the values. The predictions I receive are good for chains of length zero (according to the MSE), but the deviation between the predicted and the true value for chains of length 1 or larger is substantial.
My question is:
How can we assess how well our model predicts chains of non-zero length?
Is this approach the correct way to make predictions from a zero inflated negative binomial model?
If yes: how do I interpret these results?
If no: what alternative can I use?
My variables are:
Dependent variable:
length of the diffusion chain (count [0,36])
Independent variables:
movie characteristics (both dummies and continuous variables).
Thanks!
It is straightforward to evaluate RMSPE (root mean square predictive error), but is probably best to transform your counts beforehand, to ensure that the really big counts do not dominate this sum.
You may find false negative and false positive error rates (FNR and FPR) to be useful here. FNR is the chance that a chain of actual non-zero length is predicted to have zero length (i.e. absence, also known as negative). FPR is the chance that a chain of actual zero length is falsely predicted to have non-zero (i.e. positive) length. I suggest doing a Google on these terms to find a paper in your favourite quantitative journals or a chapter in a book that helps explain these simply. For ecologists I tend to go back to Fielding & Bell (1997, Environmental Conservation).
First, let's define a repeatable example, that anyone can use (not sure where your trainData comes from). This is from help on zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that calculate these. But here's the by hand approach. First calculate observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then get the confusion matrix (basis of FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration we can do it visually or via calibration
# let's look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Transforming the counts to "see" what is going on:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case you could also evaluate the correlations, between the predicted counts and the actual counts, either including or excluding the zeros.
So you could fit a regression as a kind of calibration to evaluate this!
However, since the predictions are not necessarily counts, we can't use a poisson
regression, so instead we can use a lognormal, by regressing the log
prediction against the log observed, assuming a Normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more fancy ways of assessing calibration I suppose, as in any modelling exercise ... but this is a start.
For a more advanced assessment of zero-inflated models, check out the ways in which the log likelihood can be used, in the references provided for the zeroinfl function. This requires a bit of finesse.

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.
Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!