Precision and Recall in Item based Recommender with boolean preferences in Mahout - boolean

I am trying to calculate precision and recall at n of a Data set with Boolean Preferences using item item Recommender given in mahout.
I am using GenericBooleanPrefItemBasedRecommender and
evaluate(RecommenderBuilder recommenderBuilder,DataModelBuilder dataModelBuilder, DataModel dataModel,IDRescorer rescorer,int at,double relevanceThreshold,double evaluationPercentage) throws TasteException;
`
Since there are Boolean preferences, the set of "relevant" or "good" movies for a user are all the ones rated 1.
If I run the same code many times it always gives the same value of precision and recall and they are always equal to each other. Why? I am NOT using RandomUtils.useTestSeed()
How does it split the data into training and test set?
Possibilities:
a)Randomly divides the total data set into test and training at the beginning OR for each user it randomly puts a fixed percentage of relevant movies into test set: :How does it decide this percentage since there is no place for user to input this as a parameter.Why do I get the same value of P and R each time I run the code and why is the value of P at n and R at n the same?
b)For each user, it puts all relevant movies in the training set:
Then there is no information left on user to make any recommendations and thus its not possible.
Since I am getting that value of P and R at n are equal, does that mean that for each user, the number of relevant movies are moved to the test set each time = number of recommendations i.e. n. If the n relevant movies put in the test set are random then why do I get same value of P and R each time I run the code.
The only explanation that I can think of that explains the results is that the recommender calculates P and R at n as follows:
One by one, for each user it randomly puts 'n' relevant movies in test set. The process has to be random since it can't distinguish between all relevant movies but the process is fixed and each time the code is run it picks the same n relevant movies for each user. It then makes n recommendations and calculates P and R at n.
While this explains the results I don't think it is a good process because:
1)The concept of training and test set is not defined as a percentage and thus not consistent with the usual definition.
2) P and R will always be equal to each other so we only get one metric as opposed to two.
3) The process of picking 'n' movies randomly is the same each time.
EDIT: I AM ADDING MY FULL CODE IN CASE IT HELPS ANSWER MY QUESTION:
public static void main (String[] args) throws Exception {
FileDataModel model = new FileDataModel(new File("data/test.csv"));
RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) {
ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
return new GenericBooleanPrefItemBasedRecommender(model, similarity);
}
};
IRStatistics stats = evaluator.evaluate(
recommenderBuilder, null, model, null, 5,
GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,1.0);
System.out.println(stats.getPrecision());
System.out.println(stats.getRecall());
}

Don't know for sure but if you seed a random number generator with the same value each time you use it, the sequence of numbers it returns will be identical. Check to see if there is a way to seed the rng with something like the system time. Just a guess.

Check out my answer on related question:
How mahout's recommendation evaluator works
I think this will help you understand how the evaluation works, how the relevant items are chosen, and how Precision and Recall are computed.

Related

How to get a new solution with high probability from previously found, incomplete solutions with different probabilities?

I am working on a AI algorithm. first when program runs a random solution is generated from which ,in first iteration of the program 10 solution vectors are created, by analyzing these solutions we could give each of them a probability ( highest , second highest, third highest and so on) towards the optimal solution , for the second input of the program I want it to be a vector (possible solution) obtained from those 10 vectors previously found. But i need the vector solution to consider all the previous solutions with a different impact depending on their probability ...
i.e A=[4.7 ,5.6, 3.5,9 ] b=[-7.9 ,8 ,-2.8 ,4.6] c=[7 ,9.7 , 4,6,3.9] ......
i used mean in my program
NextPossibleSolution = mean(([A;B;C;]))
But do you think mean is the right move ? i don't think because all the solution contributes equal to Next Possible Solution (next input) regardless of their likelihood ... Please if there is a method formula or anything , Let me know that ... I really need it badly .... A Billion Thanks

Johansen test on two stocks (for pairs trading) yielding weird results

I hope you can help me with this one.
I am using cointegration to discover potential pairs trading opportunities within stocks and more precisely I am utilizing the Johansen trace test for only two stocks at a time.
I have several securities, but for each test I only test two at a time.
If two stocks are found to be cointegrated using the Johansen test, the idea is to define the spread as
beta' * p(t-1) - c
where beta'=[1 beta2] and p(t-1) is the (2x1) vector of the previous stock prices. Notice that I seek a normalized first coefficient of the cointegration vector. c is a constant which is allowed within the cointegration relationship.
I am using Matlab to run the tests (jcitest), but have also tried utilizing Eviews for comparison of results. The two programs yields the same.
When I run the test and find two stocks to be cointegrated, I usually get output like
beta_1 = 12.7290
beta_2 = -35.9655
c = 121.3422
Since I want a normalized first beta coefficient, I set beta1 = 1 and obtain
beta_2 = -35.9655/12.7290 = -2.8255
c =121.3422/12.7290 = 9.5327
I can then generate the spread as beta' * p(t-1) - c. When the spread gets sufficiently low, I buy 1 share of stock 1 and short beta_2 shares of stock 2 and vice versa when the spread gets high.
~~~~~~~~~~~~~~~~ The problem ~~~~~~~~~~~~~~~~~~~~~~~
Since I am testing an awful lot of stock pairs, I obtain a lot of output. Quite often, however, I receive output where the estimated beta_1 and beta_2 are of the same sign, e.g.
beta_1= -1.4
beta_2= -3.9
When I normalize these according to beta_1, I get:
beta_1 = 1
beta_2 = 2.728
The current pairs trading literature doesn't mention any cases where the betas are of the same sign - how should it be interpreted? Since this is pairs trading, I am supposed to long one stock and short the other when the spread deviates from its long run mean. However, when the betas are of the same sign, to me it seems that I should always go long/short in both at the same time? Is this the correct interpretation? Or should I modify the way in which I normalize the coefficients?
I could really use some help...
EXTRA QUESTION:
Under some of my tests, I reject both the hypothesis of r=0 cointegration relationships and r<=1 cointegration relationships. I find this very mysterious, as I am only considering two variables at a time, and there can, at maximum, only be r=1 cointegration relationship. Can anyone tell me what this means?

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.
Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

Scala: large calculation losing value to zero/infinity

I'm trying calculate a perplexity value for a language model and the calculation uses a lot of large powers. I have tried converting my calculation to log space using BigDecimal, but I'm not having any luck.
var sum=0.0
for(ngram<-testNGrams)
{
var prob = Math.log(lm.prob(ngram.last, ngram.slice(0,ngram.size-1)))
if (prob==0.0) sum = sum
else sum = sum + prob
}
Math.pow(Math.log(Math.exp(sum)),-1.0/wordSize.toDouble)
How can I perform such a calculation in Scala without losing my large/small values to zero/Infinity? It seems like a trivial question but I haven't managed to do it.
In the above, you can assume that the method lm.prob issues the correct probabilities between 0 and 1, this has been amply tested.
Write everything in terms of log probabilities, not probabilities.
For instance, things like log(exp(sum)) just warm up your CPU while throwing away useful information. Avoid!
If you must convert to actual probabilities, do so at the very last step you can.

Jahmm lib: how to interpret negative value from ForwardBackwardScaledCalculator.lnProbability()?

I use the Jahmm library for classification of accelerometer sequences.
I have created my models but when i try to calculate the proibablity of a test sequence on a model by:
ForwardBackwardScaledCalculator fbsc = new ForwardBackwardScaledCalculator(test_pair.getValue(),model_pair.getValue().get_hmm());
System.out.println(fbsc.lnProbability());
I get negative values like -1278.0926336276573.
The comment in the code of the library states that the lnProbability method:
Return the napierian logarithm of the probability of the sequence that
generated this object.
Returns: The probability of the sequence of interest's napierian
logarithm
But how to compare two of such logarithms? I call the method on two different models with the two test sequences so i get 4 probabilities:
The test sequence: fast_test.seq on fast_model yields a Napierian log from -1278.0926336276573
The test sequence: fast_test.seq on slow_model yields a Napierian log from -1862.6947488370433
The test sequence: slow_test.seq on fast_model yields a Napierian log from -4433.949818774553
The test sequence: slow_test.seq on slow_model yields a Napierian log from -4208.071445499895
But in this context, does it mean that the closer we get to zero, the more similar the test sequence is to the model (so in this example the classification accuracy = 100%?)
Thank you
If by "Napierian logarithm", the natural logarithm is meant, then you can get a probability from a return value x by raising e to the x, e.g. using Math.exp. However, the reason that logarithms are returned is because the probability values are too small to represent in a double; Math.exp(-1278.0926336276573) will simply return zero. See the Wikipedia article about log probabilities.
does it mean that the closer we get to zero, the more similar the test sequence is to the model
exp(0) == 1 and log(1) == 0, and indeed the lower the probability, the smaller (more negative) its logarithm. So, the closer you get to zero, the more probable the sentence is under the model.
However, this need not directly relate to "similarity to a model", let alone "classification accuracy", since HMMs (being generative models) will ascribe lower probability to longer sequences. Read up on HMMs in your favorite textbook; a full explanation would be too long for this answer box and is a math question, so off-topic for this website.