Analyse abstract data - scala

I need to process a lot of csv files that contains 3 columns: date, tv channel id, movie id.
Based on those columns, i need to classify what is the genre of each movie and the genre of tv channel id.
I'm new to big data process and i was wondering how can i classify that data if i only have an id (i can not use another source to search the id or generate random data to train my algorithm).
The solution that i found is define some range of hours and put the films that are on range inside some genre. Example:
movies that are played between 01:00-04:00, genre 1;
movies that are played between 04:01-06:00, genre 2;
etc.
After classify movies, i can classify the tv channels based on movies that they have played.
And i'm planning to do it using Spark :)
Anyone have another solution or any advice? It's kinda hard because those data looks like só abstract.
Thank you

When you say "I need to classify the genre of the movie", do you mean "Drama", "Comedy", "Action", or "Genre1", "Genre2"? I would suppose the second case in the following.
Don't assign a genre by hand - Use a clustering algorithm
First, I won't assign a genre based only on the time when the movie is played. Generally speaking, I would prevent you from doing the clustering by hand. As this is what clustering algorithms are made for. Those use features to group individuals that are, in a way, related one to each other.
In your case, there is a tricky part : each data point/row is not a movie. Thus, a movie might be present in different clusters, meaning having different genres.
There are several options :
Either a movie belons to different genres - which is quite natural.
You can choose only one genre based on the group where the movie appears the most frequently
If you decide to assign multiple genre per movie, you might think of a thresholds : for instance, if a movie appears less than N times in a group, then it does not belong to this group (unless it is the only groups it appears)
Create new features
You should design as much new features* as you can, helping the clustering algorithm to separate the data well and create homogeneous clusters.
As I can think of, you can do :
Add an boolean feature for each time frame you consider (0:00 - 3:59 ; 4:00 - 6:00 ; ... ). Only one of these features is one : when the movie is played. The others are null.
A feature counting how many times the movie has been played (Men in Black is more played than 12 Angry Men))
A feature couting how many channel ID have played this movie (Star Wars is played on more channel than some Bollywood movie)
...
Think of how a genre is represented / played throughout all channels and create the featuers accordingly.
PS: * Don't get me wrong, as much features means more than your three features but what's called the curse of dimensionality.

Related

Unequal input vector lengths for Neural Network

I'm trying to predict whether a player of a video game will stop playing the game (0/1 for not-stopping/stopping) within the next month based on the game data from matches they've had so far.
Each match a player plays generates (X) data points, however, each player may have played a different number of matches to date (M), thus when a player's data is put into one long vector, the length of their vector will be X*M.
I'm very new to how neural networks work, but it is my understanding that each row of the dataset must have the same number of columns. Is this true?
In light of my problem, I've brainstormed three possible solutions that each make a compromise of sorts...
[Possible solution 1: aggregate data]
I have considered aggregating the match data as one solution, so for example instead of "points in match 1, points in match 2,..., points in match n", just having "average points per match". However I'm concerned this isn't the best compromise as averaging certain data would compromise resolution (i.e. getting a lot of points as one character in the game vs. getting a lot of points as another character may be a useful difference).
[Possible solution 2: add zeros for missing data]
Maybe if I have player A who plays 6 matches and player B who plays only 5 matches so far, I could just add zeros to make player B's vector as long as player A's. However I worry that adding zeros like this will be like adding a ton of noise to my data.
[Possible solution 3: trim data to same size]
I could set a specific number of matches for each player vector to contain, maybe 10 matches for example. So if a player has less than 10 matches they'll be dropped from the dataset or if the player has more than 10 matches, only their first 10 will appear in the dataset. The only downside here is that the only players that would have a true-prediction label of 1 (stopped playing) are players that played EXACTLY 10 games... but I'm not just interested in predicting that, I obviously want a more general prediction.
How can I train a neural network on vectors of unequal length???
So my question was probably too long for people to want to read, but anyway, my question discussed some possible ways to deal with uneven input vector lengths for a neural network (or potentially any machine learning algorithm that would require equal lengths).
One solution I hadn't thought of, but was suggested to me in a Quora answer, is to only include players that played at least 3 (some small number) games so you can include that as raw data and then for 4-n games aggregate the results.
This is essentially a good compromise between my "Possible solution 1" and "Possible solution 3" in the question details above.

Schema for golf tournament scores

I work for a small non-profit organization and we are looking for a way to quickly tally scores and provide a few statistics for donors at our annual golf tournament. I thought this would be fairly easy, but I'm struggling to come up with a database schema to capture the scores. I can't figure out how the player's score relates to the specific hole on the course.
This is the diagram that I have so far. Am I way off base with this?
The Schema can be found here: https://app.quickdatabasediagrams.com/#/schema/forneGJp40inm7rWlf2Sbg
Perhaps make the Scores table a m:n join table between Players and Holes to capture each player's score on each hole. This is depicted on the diagram below. To get the score for a round you'd sum all scores for all holes with a specific CourseId, for a specific event.
I also denormalised it a little, adding a total score to the Rounds table. This means that you don't need to SUM() the individual scores every time to get the tallies for each Player's Round. That's just a suggestion for performance optimisation.
Source: https://app.quickdatabasediagrams.com/#/schema/x_amshIckkeGp8KAKEAmLQ
If it is possible to play the same course twice in the same event (the first and last matches could be on the same course, for example) then you should provide for that.
I have two other suggestions:
I think the relationship between Events and Venues is in the wrong direction.
I suggest splitting Players into two tables. One representing the human being, and the other representing the human's participation in the round. Perhaps "Person" and "Contestant" would be good names.

How to train an ANN to play a card game?

I would like to teach an ANN to play Hearts, but I am stuck on how to actually perform the training.
A friend suggested to use weka for the implementation of the actual ANN, but I've never used it, so I'm leaning towards a custom implementation.
I have programmed the rules and I can let the computer play a game, choosing random but legal cards each turn.
Now I am at a loss of what to send to the ANN as input and how to extract output (decreasing amount of cards each turn, so I can't let each output neuron be a possible card) and how to teach it and when to perform teaching.
My guess is to give the ANN as input:
The cards that have been played previously, with metadata of which player has played which card
The cards on the table for this turn, also with the same metadata
The cards in the ANN's hand
And then have the output be 13 neurons (the maximal amount of cards per player), of which I take the most activated of the cards that still are in the ANN's hand.
I also don't really know when to teach it (after each turn or after each game), as it is beneficial to have all the penalty cards, but bad to have all but one penalty card.
Any and all help is appreciated. I don't really know where else to put this question.
I currently have it programmed in Swift, but it's only 200 lines and I know a few other languages, so I can translate it.
Note that neural networks might not be the best thing to use here. More on that at the end of the answer, I'll answer your questions first.
Now I am at a loss of what to send to the ANN as input and how to extract output (decreasing amount of cards each turn, so I can't let each output neuron be a possible card) and how to teach it and when to perform teaching.
ANNs require labeled input data. This means a pair (X, y) where X can be whatever (structured) data related to your problem and y is the list of correct answers you expect the ANN to learn for X.
For example, think about how you would learn math in school. The teacher will do a couple of exercises on the blackboard, and you will write those down. This is your training data.
Then, the teacher will invite you to the blackboard to do one on your own. You might not do so well at first, but he/she will guide you in the right direction. This is the training part.
Then, you'll have to do problems on your own, hopefully having learnt how.
The thing is, even this trivial example is much too complex for an ANN. An ANN usually takes in real-valued numbers and outputs one or more real-valued numbers. So it's actually much dumber than a grade schooler who learns about ax + b = 0 type equations.
For your particular problem, it can be hard to see how it fits in this format. As a whole, it doesn't: you can't present the ANN with a game and have it learn the moves, that is much too complex. You need to present it with something for which you have a correct numerical label associated with and you want the ANN to learn the underlying pattern.
To do this, you should break your problem up into subproblems. For example, input the current player's cards and expect as output the correct move.
The cards that have been played previously, with metadata of which player has played which card
The ANN should only care about the current player. I would not use metadata or any other information that identifies the players.
Giving it a history could get complicated. You might want recurrent neural networks for that.
The cards on the table for this turn, also with the same metadata
Yes, but again, I wouldn't use metadata.
The cards in the ANN's hand
Also good.
Make sure you have as many input units as the MAXIMUM number of cards you want to input (2 x total possible cards, for the cards in hand and those on the table). This will be a binary vector where the ith position is true if the card corresponding to that position exists in hand / on the table.
Then do the same for moves: you will have m binary output units, where the ith will be true if the ANN thinks you should do move i, where there are m possible moves in total (pick the max if m depends on stages in the game).
Your training data will also have to be in this format. For simplicity, let's say there can be at most 2 cards in hand and 2 on the table, out of a total of 5 cards, and we can choose from 2 moves (say fold and all in). Then a possible training instance is:
Xi = 1 0 0 1 0 0 0 0 1 1 (meaning cards 1 and 4 in hand, cards 4 and 5 on table)
yi = 0 1 (meaning you should go all in in this case)
I also don't really know when to teach it (after each turn or after each game), as it is beneficial to have all the penalty cards, but bad to have all but one penalty card.
You should gather a lot of labeled training data in the format I described, train it on that, and then use it. You will need thousands or even tens of thousands of games to see good performance. Teaching it after each turn or game is unlikely to do well.
This will lead to very large neural networks. Another thing that you might try is to predict who will win given a current game configuration. This will significantly reduce the number of output units, making learning easier. For example, given the cards currently on the table and in hand, what is the probability that the current player will win? With enough training data, neural networks can attempt to learn these probabilities.
There are obvious shortcomings: the need for large training data sets. There is no memory of how the game has gone so far (unless you use much more advanced nets).
For games such as these, I suggest you read about reinforcement learning, or dedicated algorithms for your particular game. You're not going to have much luck teaching an ANN to play chess for example, and I doubt you will teaching it to play a card game.
First of all you need to create some good learning data set for training ANN. If your budget allows you can ask some cards professionals to share with you enough of their matches of how they played cards. Another way of generating data could be some bots, which play cards. Then you need to think how to represent data set of playing matches to neural network. Also I recommend you to represent cards not by their value (0.2, 0.3, 0.4, ..., 0.10, 0.11 (for jack), but as separated input. Also look for elastic neural networks which can be used for such task.

Tag Clustering in Lastfm database

I have a last.fm dataset composed of songs and their tags given by the users. I want to apply a clusterization on the dataset in order to find clusters of songs based on tags.
The dataset has 200k songs and 119k different tags. I was previously thinking on making a matrix NxM, where N is the number of songs and M is the number of attributes, and each position is 0 or 1 indicating the presence or not presence of a tag in the song. However, the huge dimension of the matrix has stopped me for doing so. I have some ideas on applying a SVD for reducing dimensionality before applying the clustering, but I don't know exactly if it is the best approach.
Therefore, does anybody know some work in the literature which attempts to perform such kind of clustering? Or any other idea in my problem?
Thank you very much in advance
Clustering probably is the wrong tool for your problem.
Are you sure you want to split your data into (usually) non-overlapping chunks? What if there is some overlap needed? Say, there are songs that are both "hip hop" and "driving beats" but these tags are not synonys?
Frequent itemset mining (Market basket analysis)
is much more applicable, isn't it?
Consider every song to be a "market basket", every tag to be an "item" in these transactions. The FIM will identify frequent tag combinations, and derive patterns from that.

How to generate recommendation with matrix factorization

I've read some papers of Matrix Factorization(Latent Factor Model) in Recommendation System,and I can implement the algorithm.I can get the similar RMSE result like the paper said on the MovieLens dataset.
However I find out that,if I try to generate a top-K(e.g K=10) recommended movies list for every user by rank the predicted rating,it seems that the movies that are thought to be rated high point of all users are the same.
Is that just what it works or I've got something wrong?
This is a known problem in recommendation.
It is sometimes called "Harry Potter" effect - (almost) everybody likes Harry Potter.
So most automated procedures will find out which items are generally popular, and recommend those to the users.
You can either filter out very popular items, or multiply the predicted rating by a factor that is lower the more globally popular an item is.