Pearson correlation fails for perfectly correlated sets - recommendation-engine

Consider the following examples of the Pearson correlation coefficient on sets of film ratings by users A and B:
A = [2,4,4,4,4]
B = [5,4,4,4,4]
pearson(A,B) = -1
A = [5,5,5,5,5]
B = [5,5,5,5,5]
pearson(A,B) = NaN
Pearson correlation seems widely used for calculating the similarity between two sets in collaborative filtering. However the sets above show high (even perfect) similarity, yet the outputs suggest the sets are negatively correlated (or an error is encountered due to div by zero).
I initially thought it was an issue in my implementation, but I've since validated it against a few online calculators.
If the outputs are correct, why is Pearson correlation considered a good choice for this application?

Person correlation measures association between two data sets i.e. how do they increase or decrease together.
In visual terms,how close do they lie on a straight line if one set is plotted on x-axis, and other on y-axis.
Example of positive correlation, irrespective of difference in scale of data sets:
For your case, the data sets are exactly similar, and hence their standard deviation is zero, which is a part of the product used in the denominator in pearson correlation calculation, hence it is undefined.
It means, it is not possible to predict the correlation i.e. how does the data increase or decrease along with other data.
In graph below, all data points lie on one point, hence predicting
the correlation pattern is not possible.
A very simple solution to this would be handle these cases seperately,
or if you want to go through the same flow, a neat hack would be to
make sure that standard deviation of any set is not zero.
Non zero standard deviation can be achieved by altering a single value of the set, with a minor amount, and since the data sets are highly correlated, it would give you the high correlation coefficient.
I would recommend that you study other measures of similarity like Euclidean distance, cosine similarity, adjusted cosine similarity too, and take informed decision on which suits your use cases more. It may be a hybrid approach too.
This tool was used to generate the graphs.

Pearson correlation divides by the standard deviation of the variables, which in your case is zero, therefore causing a divide by zero error. It is considered good because no real data set has a standard deviation of zero. In other words, complete uniform data sets are out of domain for the Pearson correlation coefficient, but that's no reason not to use it.

Related

How can i simulate a lognormal distribution without knowing mean and standard deviation?

Consider the Lucas endowment economy with inflation, we know that consumption growth and inflation are log-normally distributed,and that consumption growth and inflation are uncorrelated through time and with each other.
How can I compute Compute the one period nominal risk free rate (1 + it,t+1)?
I have to solve this problem through matlab and tried using
lognrnd()
g_t1 = lognrnd(mu_c, sg_c1)
g_t2 = lognrnd(mu_c, sg_c2)
pi_t1 = lognrnd(mu_pi, sg_pi)
pi_t2 = lognrnd(mu_pi, sg_pi)
but I don't know how to go on without any values. How can I then assign the distribution values to a vector or matrix?
You can’t generate simulated values from a distribution without providing a concrete parameterization for that distribution.
If you can’t use theory to determine the parameter values but you have access to observational data, you can estimate the parameter values. Alternatively, you can use subject matter opinions for your problem context, or WAGs (Wild-Assed Guesses). In all of these cases, be aware that the true parameter values almost certainly differ from the values you are using. Consequently, I recommend using design of experiments over plausible ranges of the values, and fitting a response surface model to determine how sensitive your simulation’s results are to variation in the input distribution’s parameters.

how to compare two hyper parameters in a hierarchical model?

In one hierarchical model, we have two hyer parameters: dnorm(A_mu, 0.25^-2) and dnorm (B_mu, 0.25^-2). In this case, 0.25 is the sd, I use the fixed number. A_mu and B_mu represent the mean of group level. After fitting the data by rjags, we get the distributions for each parameter. So I just directly compare the highest posterior density interval (HDI) of A_mu and B_mu? Do I need to calculate something using the sd(0.25)?
In another case, if the sd of two hyper parameters is not fixed, like that: dnorm(A_mu, A_sd) and dnorm (B_mu, B_sd). How can I compare the two hyper parameters and make a decision, e.g. this group is significantly different another group?
Remember that you are getting posterior distributions for A_mu and B_mu. This makes your comparison easy as you can have a look at 95% confidence intervals (CI) for the parameters (or pick a confidence value that satisfies your needs). I believe JAGS uses Gibbs sampling and so you should be able to get the raw samples from the posteriors for A_mu and B_mu. You can then ask "what is the probability that B_mu is greater than some value?" by calculating the percentage of posterior samples that are greater than that value. Alternatively, and in a similar way to frequentist Hypothesis testing, you can ask what is the probability that the mean of B_mu is a draw from the posterior of A_mu. So the key is just to directly use the samples from your posterior. I would recommend taking a look at Andrew Gelman's BDA3 textbook (Chapter 4) for a really good reference on these concepts.
A few things to keep in mind before drawing conclusions from the data: (1) you should always check the validity of your Markov Chains by evaluating things like autocorrelation (2) try to do a posterior predictive check to make sure your model is well fit to the data. If your model is poorly fit to the data then you can get very misleading results from the procedure above.

Multiclass classification or regression?

I am trying to train a CNN model to classify images based on their aesthetic score. There are 2,00,000 images and every image is rated by more than 100 subjects. Mean score is calculated and the scores are normalized.
The distribution of the scores is approximately gaussian. So I have decided to build a 10 class classification model after assigning appropriate weight for each class as the data is imbalanced.
My question:
For this problem, the scores are continuous, ie, 0<0.2<0.3<0.4<0.5<..<1.
Then does that mean this is a regression problem? If so, how do I balance the data for a regression problem, as most of the datapoints are present in between 0.4 and 0.6.
Thanks!
Since your labels are continuous, you could divide them in to 10 equal quantiles using a technique like pandas.qcut() and provide label to each classes. This can turn a regression problem to a classification problem.
And as far as the imbalance is concerned, you may want to try to oversample the minority data. This will ensure your model is not biased towards majority data.
Hope this helps.
I would recommend you to do a Histogram Equalization over ALL data of your participants first, so that their ratings are destributed equaly.
Then for each image in your training set calculate the Expected Value (and if you also want to, the Variance) The Expected Value is just the mean of the votes. For the Variance there are standard functions in (almost) every programming language where you can input an array of votes which will output the Variance.
Now take the Expected Value (and if you want also the Variance) as your ground truth for your Network.
EDIT: Histogram Equalization:
Histogram equalization is a method to use the given numerical range as efficient as possible.
In the context of images, this would change the pixel values, so that the darkest pixel becomes the value 0 and the lightest value becomes 255. Furthermore every grayscale value gets destributed so that it occurs as often as each other (in average). For your dataset you want the same. Even though your values are not from 0 to 255 but from 0 to 10. Furthermore you don't need to (and shoudn't) round the resulting values to integers. In this way more often occurring votes are more spread and less often votes are contracted.
Maybe you should first calculate the expected value and than do the histogram equalization over the expected values of all images.
By this the CNN sould be able to better differentiate those small differences.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

How to find the "optimal" cut-off point (threshold)

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨
If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.