Create pairs with weighting - matlab

Programming has been a while for me, apologies.
I have a list of pairs, where the pair is a person and the affiliated group of the person. That is X = {'Person1','GroupA';'Person2','GroupB';'Person3','GroupA';...}
Now I would like to create random pairs of the people in this list. This is pretty straightforward. However, I would like that the probability of two person matching which are in the same group to be low, e.g., 10% or x% and people from different groups 90%.
Does someone have a algorithm for this? Preferably MATLAB or a different programming language?

You can build the list picking the right around of percentage yourself then use function like randperm to shuffle them and make the list ramdom.


Methods for searching for people with similar purchasing habits in big data with a given person as the base

I'm looking at finding people with similar purchasing behaviors with a given person or group as a starting point for a market research problem.
I'm going to use vectors and represent every person and their habits as a vector and then compare these vectors to return the base person or group. I'd probably use Faiss. I believe KNN can be used too.
But I'm looking to see is if I can use other methods such as clustering methods like k-means clustering for such a question, and with the presence of a given person or group as the base. I thought the only way clustering algs would work is to first cluster the data, then return the group that the 'base person or group' falls into. However, this would be costly and probably not very accurate. But potentially this technique can be used to reduce the search space.
So, do you know of any other ways? (non-Machine Learning or Information Retrieval methods would be welcomed too :) )

Creating groups comparable based on a number of variables

I have a number of students and I want to divide them into groups. I have measured 5 skills in my students. The goal is to assign students to groups in such as a way that all groups have comparable levels of each skill. In other words I want each of the skills to be distributed comparably across groups, and not concentrated in some groups. What statistical analysis may do this? Preferably in SPSS
You probably want your groups to have a certain size, too?
This looks more like a resource allocation rather than a clustering problem to me. Think of skills as resources.

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
>>> model.similarity('woman', 'man')
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec : (Warning : Lots of Maths Ahead)

tf-idf - accessing a large sparse scipy matrix & getting the highest values

For the tfidf result matrix, I wanted to get the top tfidf values. I saw how one could set max features amount for the tfidf vectorizer, but that is for the words with the top tf count. I want to still get the high values for the tfidf, which could include words with low tf. One idea I looked up is doing something like tf_idf_matrix.sum(axis=0), which would sum up the columns. This works in my code, but because of 113k columns, print wont show them all. If I could use something like argsort() to access the top K column sum values, that would be helpful.
This question stems off my original question which is here.
The reason is that I want to know which words are the ones I should look at closer, and not necessarily the ones that have the highest frequency. I would also like to know about the "anomalies" that is, words that might not appear in all or many documents/posts but could have a high tfidf in a one or fewer documents. In case there are other approaches I should consider, I wanted to explain this.

How to generate recommendation with matrix factorization

I've read some papers of Matrix Factorization(Latent Factor Model) in Recommendation System,and I can implement the algorithm.I can get the similar RMSE result like the paper said on the MovieLens dataset.
However I find out that,if I try to generate a top-K(e.g K=10) recommended movies list for every user by rank the predicted rating,it seems that the movies that are thought to be rated high point of all users are the same.
Is that just what it works or I've got something wrong?
This is a known problem in recommendation.
It is sometimes called "Harry Potter" effect - (almost) everybody likes Harry Potter.
So most automated procedures will find out which items are generally popular, and recommend those to the users.
You can either filter out very popular items, or multiply the predicted rating by a factor that is lower the more globally popular an item is.