constructing feature vector to learn cluster - cluster-analysis

Example:
Doc {
Citations: {
0: cite0,
1: cite1,
2: cite2,
...
n: citeN
}
}
I am suppose to cluster documents base on the similarity of citations, but each document will have many citations. My confusion here is ...how do I construct the feature vector for the dataset in this case to feed it into my clustering toolkit.
I am thinking to let column be the citations, and the value to be 1 if that document has that citation.
ps. my background in machine learning is pretty weak - I am reading my lecture notes, but most doesn't touch on this kind of problems >< thank you all in advance!

One simple way of constructing your feature vector would be to create Adjacency matrix (say A). The features are binary.
Each row will represent cited document and column will represent citing document. So, if Document1 is cited by Document3 only, element A(1,3)=1 and rest elements of the row are 0.
If you are dealing with too many documents, this might not be efficient way. If you have N documents, the matrix size is NxN.
If you are writing your own clustering algorithm, make it accept more compact forms (see Adjacency list instead).

Related

Questions about LSH (Locality-sensitive hashing) and minihashing implementation

I'm trying to implement this paper
Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic
I got a couple of questions about the LHS algorithm in general and the proposed implementation:
The LSH algorithm it's used only when you have a lot of documents to compare with each other (because it is supposed to put the similar ones in the same bucket from what I got). If for example I have a new document and I want to calculate the similarity with the others, I have to relaunch the LHS algorithm from scratch, including the new document, correct?
In 'Mining of Massive Datasets, Ch3', it is said that for the LHS we should use one hash function per band. Each hash function creates n buckets.
So, for the first band, we are going to have n buckets. For the second band onward, Am I supposed to keep using the same hash function (so this way I keep using the same buckets as before) or another one (ending so with m>>n buckets)?
This question is related t the previous one. If I use the same hash function for all the bands, then I'll have n buckets. No problem here. But If I have to use more hash functions (one different function per row), I'm going to end up with a lot of different buckets. Am I supposed to measure the similarity for each pair in each bucket? (If I have to use only one hash function then here it's not a problem).
In the paper, I understood most of the algorithm except for its end.
Basically, two Signatures matrices are created (one for stable features and one for unstable features) via minhashing. Then, they use LSH on the first matrix to obtain a list of candidates pairs. So far so good.
What happens at the end? do they perform the LHS on the second matrix? How the result of the first LHS is used? I cannot see the relationship between the first and the second LHS.
The output of the final step is supposed to be a list of pairing candidates, right? and all that I have to do is performing Jaccard similarity on them and setting a threshold, right?
Thanks for your answers!
I got a partial answer to my question (still missing question 4)
No. You would keep the bucket structure and hash the new doc into it. Then compare with only those docs in one of the buckets it fell into.
No. You HAVE to use different hash functions and a different set of buckets for each hash function.
This is irrelevant because of the answer to (2).

Generate subset of data with known mean

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

In preprocessing data with high cardinality, do you hash first or one-hot-encode first?

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:
What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?
Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.
For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.
1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday
Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.
The example in wikipedia is a good one. Suppose your have three documents:
John likes to watch movies.
Mary likes movies too.
John also likes football.
Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).
The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.
Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.
Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).
bucket1 bucket2 bucket3
doc1: 3 2 0
doc2: 2 2 0
doc3: 1 0 2
Now you successfully transformed the features in 9-dimensions to 3-dimensions.
A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.
Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:
Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
The bag of words model now holds the common keywords and also use-specific keywords.
All words are then hashed into a low dimensioanl feature space where the document is trained and classified.
Now to answer your questions:
Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models.
You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.

How to handle new data for recommendation system?

Here's a theoretical question. Let's assume that I have implemented two types of collaborative filtering: user-based CF and item-based CF (in the form of Slope One).
I have a nice data set for these algorithms to run on. But then I want to do two things:
I'd like to add a new rating to the data set.
I'd like to edit an existing rating.
How should my algorithms handle these changes (without doing a lot of unnecessary work)? Can anyone help me with that?
For both cases, the strategy is very similar:
user-based CF:
update all similarities for the affected user (that is, one row and one column in the similarity matrix)
if your neighbors are precomputed, compute the neighbors for the affected user (for a complete update, you may have to recompute all neighbors, but I would stick with the approximate solution)
Slope-One:
update the frequency (only in the 'add' case) and the diff matrix entries for the affected item (again, one row and one column)
Remark: If your 'similarity' is asymmetric, you need to update one row and one column. If it is symmetric, updating one row automatically results in updating the corresponding column.
For Slope-One, the matrices are symmetric (frequency) and skew symmetric (diffs), so if you handle you also need to update one row or column, and get the other one for free (if your matrix storage works like this).
If you want to see an example of how this could be implemented, have a look at MyMediaLite (disclaimer: I am the main author): https://github.com/zenogantner/MyMediaLite/blob/master/src/MyMediaLite/RatingPrediction/ItemKNN.cs
The interesting code is in the method RetrainItem(), which is called from AddRatings() and UpdateRatings().
The general thing are called online algorithms.
Instead of retraining the whole predictor, it can be updated "online" (while remaining useable) with the new data only.
If you google for "online slope one predictor" you should be able to find some relevant approaches from literature.