prioritize a specific feature while embedding for recommendation system - feature-selection

I am trying to build a content based movie recommendation system.
Say I am using fields are, Director, Crew, Genre.
Procedure as follows,
Concatenate all the features as 'Director' + ' '+ 'Crew' + ' '+ 'Genre'.
Count Vectorizer of this concatenated feature.
Calculate similarity score
Order then according to the similarity score.
Here all the features have same priority.
Now I want to prioritize the 'Director'. Say I want the recommend movies tends to be have same Director.

Related

Building propensity score for a cluster

I am working on an exercise to build influencer score for each user in my data set. What that means is that a user with higher engagement should get higher score and vice versa. however, i have many different type of engagement variables and i am not sure which one should weight higher.
so, i first did a cluster analysis to divide users into different group based on engagement activity using 5 different types of engagement. Based on this, i found that one of the cluster has high level of engagement across all the different types of engagement variables. This is the group i am interested in. however, it is possible that the group size i get may be smaller than the number of users i want to use in future. so, i want to now use these clusters and create a propensity score.
e.g. in the cluster analysis, say i get 5 clusters c1, c2,c3,c4,c5 and c5 is my cluster of interest. so, i give all users in c5 a value of 1 (= influencer) and i give all users in c1 to c4 a value of 0 (= not influencer). now, i use this binary variable and build a logistic regression model (using same engagement variables as used for clustering) to get propensity for everyone to an influencer. this way, i can change the threshold to reduce or increase the numbers of users i want to select.
Now, the issue i am running in is that one of the engagement variable is able to predict influencer very well and hence my propensity scores are very close to either 1 or 0 which defeats the purpose of why i wanted the propensity score in the first place.
S0, 2 questions -
1) is this approach of building a unsupervised classification and then using this to build supervised classification a sound approach of what i am trying to do?
2) how do i reduce the contribution from the variable that predicts influencer really well to ensure that i get much more smoother curve instead of values near 0 or 1. i don't want to remove this variable from the model as this is important from business perspective.

Analyse abstract data

I need to process a lot of csv files that contains 3 columns: date, tv channel id, movie id.
Based on those columns, i need to classify what is the genre of each movie and the genre of tv channel id.
I'm new to big data process and i was wondering how can i classify that data if i only have an id (i can not use another source to search the id or generate random data to train my algorithm).
The solution that i found is define some range of hours and put the films that are on range inside some genre. Example:
movies that are played between 01:00-04:00, genre 1;
movies that are played between 04:01-06:00, genre 2;
etc.
After classify movies, i can classify the tv channels based on movies that they have played.
And i'm planning to do it using Spark :)
Anyone have another solution or any advice? It's kinda hard because those data looks like só abstract.
Thank you
When you say "I need to classify the genre of the movie", do you mean "Drama", "Comedy", "Action", or "Genre1", "Genre2"? I would suppose the second case in the following.
Don't assign a genre by hand - Use a clustering algorithm
First, I won't assign a genre based only on the time when the movie is played. Generally speaking, I would prevent you from doing the clustering by hand. As this is what clustering algorithms are made for. Those use features to group individuals that are, in a way, related one to each other.
In your case, there is a tricky part : each data point/row is not a movie. Thus, a movie might be present in different clusters, meaning having different genres.
There are several options :
Either a movie belons to different genres - which is quite natural.
You can choose only one genre based on the group where the movie appears the most frequently
If you decide to assign multiple genre per movie, you might think of a thresholds : for instance, if a movie appears less than N times in a group, then it does not belong to this group (unless it is the only groups it appears)
Create new features
You should design as much new features* as you can, helping the clustering algorithm to separate the data well and create homogeneous clusters.
As I can think of, you can do :
Add an boolean feature for each time frame you consider (0:00 - 3:59 ; 4:00 - 6:00 ; ... ). Only one of these features is one : when the movie is played. The others are null.
A feature counting how many times the movie has been played (Men in Black is more played than 12 Angry Men))
A feature couting how many channel ID have played this movie (Star Wars is played on more channel than some Bollywood movie)
...
Think of how a genre is represented / played throughout all channels and create the featuers accordingly.
PS: * Don't get me wrong, as much features means more than your three features but what's called the curse of dimensionality.

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select groups of players via their stats that meet certain criteria.
Once I have the subset of players I am interested in looking at further, I would like to find the mean of a column; eg Batting Average or RBIs. From there I would like to break all the players into percentile groups based on their average performance compared to all players; the top 10%, bottom 10%, 40-50%
I've been able to use the DataFrame.describe() function to return a summary of a desired column (mean, stddev, count, min, and max) all as strings though. Is there a better way to get just the mean and stddev as Doubles, and what is the best way of breaking the players into groups of 10-percentiles?
So far my thoughts are to find the values that bookend the percentile ranges and writing a function that groups players via comparators, but that feels like it is bordering on reinventing the wheel.
I was able to get the percentiles by using Windows Functions and apply ntile() and cumeDist() over the window. The ntile() can create grouping based off of an input number. If you want things grouped by 10%, just enter ntile(10), if by 5% then ntile(20). For a more fine-tuned restult, cumeDist() applied over the window will output a new column with the cumulative distribution, and those can be filtered from there through select(), where(), or a SQL query.

Clustering/Nearest Neighbor

I have thousands to ten-thousands of data points (x,y)coming from 5 to 6 different source. I need to uniquely group them based on certain distance criteria in such a way that the formed group should exactly contain only one input from each source and each of them in the group should be within certain distance d. The groups formed should be the best possible match.
Is this a combination of clustering and nearest neighbor?
What are the recommendation for the algorithms?
Are there any open source available for it?
I see many references saying KD tree implementation and k-clustering etc. I am not sure how can I tailor to this specific need.

Mahout Log Likelihood similarity metric behaviour

The problem I'm trying to solve is finding the right similarity metric, rescorer heuristic and filtration level for my data. (I'm using 'filtration level' to mean the amount of ratings that a user or item must have associated with it to make it into the production database).
Setup
I'm using mahout's taste collaborative filtering framework. My data comes in the form of triplets where an item's rating are contained in the set {1,2,3,4,5}. I'm using an itemBased recommender atop a logLikelihood similarity metric. I filter out users who rate fewer than 20 items from the production dataset. RMSE looks good (1.17ish) and there is no data capping going on, but there is an odd behavior that is undesireable and borders on error-like.
Question
First Call -- Generate a 'top items' list with no info from the user. To do this I use, what I call, a Centered Sum:
for i in items
for r in i's ratings
sum += r - center
where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example
I use a centered sum instead of average ratings to generate a top items list mainly because I want the number of ratings that an item has received to factor into the ranking.
Second Call -- I ask for 9 similar items to each of the top items returned in the first call. For each top item I asked for similar items for, 7 out of 9 of the similar items returned are the same (as the similar items set returned for the other top items)!
Is it about time to try some rescoring? Maybe multiplying the similarity of two games by (number of co-rated items)/x, where x is tuned (around 50 or something to begin with).
Thanks in advance fellas
You are asking for 50 items similar to some item X. Then you look for 9 similar items for each of those 50. And most of them are the same. Why is that surprising? Similar items ought to be similar to the same other items.
What's a "centered" sum? ranking by sum rather than average still gives you a relatively similar output if the number of items in the sum for each calculation is roughly similar.
What problem are you trying to solve? Because none of this seems to have a bearing on the recommender system you describe that you're using and works. Log-likelihood similarity is not even based on ratings.