How can I cluster buckets of strings? - cluster-analysis

I have several buckets. Each bucket contains many tags (strings). How can I cluster buckets together based on similarity or overlap?
E.g.
Bucket A: 'ostrich', 'sparrow', 'hummingbird', 'zebra', 'blue jay'
Bucket B: 'banana', 'watermelon', 'grape', 'carrot'
Bucket C: 'celery', 'lettuce', 'spinach', 'banana', 'carrot'
Bucket D: 'sparrow', 'dog', 'cat', 'lion', 'elephant', 'horse'
In this very, very small example, B+C would make one cluster (because of banana & carrot) while A and D would each be in their own cluster because there isn't quite enough to cluster them.

You can use set based distances such as Jaccard with hierarchical clustering.

Related

PySpark Proportionate Stratified Sampling "sampleBy"

Question: If you implement proportionate stratified sampling using PySpark's sampleBy, isn't it just the same thing as a random sample?
Edit: there is proportionate and disproportionate stratified sampling. This question is about the former.
Here's my thinking on this:
Let's say you have 4 groups in a population of total size N = 1000. The groups have proportions:
A: 25%, B: 50%, C: 13%, and D: 12%
Then choosing a proportionate stratified sample of size 100 means choosing a sample consisting of exactly 25 elements from A, 50 elements from B, 13 elements from C and 12 elements from D. (Note: A disproportionate stratified sample would be if you had different sampling ratios than those of the population.)
This is in contrast to doing a random sample where the expected number of elements from A, B, C and D are 25, 50, 13, and 12 respectively.
It would be natural to implement proportionate stratified sampling in PySpark via the sampleBy method with fractions
fractions = {'A': .1, 'B': .1, 'C': .1, 'D': .1}
If this method sampled exactly, you'd have 25, 50, 13 and 12 elements respectively. However, this method is implemented with a Bernoulli trial (coin flipping). For stratified sampling, since all the fractions are identical, so each element is chosen with probability 10%.
In this case doing the Bernoulli trial by strata and then by element is the same as doing this over the entire data set. The latter is just random sampling.
Conclusion: Stratified sampling is "not possible" in this paradigm.
Is this a correct understanding?
I've seen some posts on doing exact sampling using special tricks. I'll see if I can answer my own post using these methods (3) below.
Note: There is a sampleByKeyExact method but it is not supported in Python and if it was, the performance and scaling penalties are not ideal.
https://spark.apache.org/docs/2.2.0/mllib-statistics.html
Related Posts:
Stratified sampling in Spark (Mentions sampleByKeyExact which isn't supported in Python)
Investopedia Reference
https://www.investopedia.com/terms/stratified_random_sampling.asp
A creative work-around using additional columns that may work.
pyspark - how to select exact number of records per strata using (df.sampleByKey()) in stratified random sampling
I think there is some confusion here related to standard definitions. Usually when someone says "stratified sampling", they mean that different classes should get different probabilities. In the example posted above
A: 25%, B: 50%, C: 13%, and D: 12%
A standard stratified sample will give fractions that make sure that in expectation, the sample will have the same number of A,B,C,D. For example
fractions = {'A': .2, 'B': .1, 'C': 0.1*50/13, 'D': 0.1*50/12}
should give in expectation 5 elements of each class.
In the example given above where
fractions = {'A': .1, 'B': .1, 'C': 0.1, 'D': 0.1}
The behavior is indeed the same as a simple sample with a proportion of 0.1.
The real question is, what are you aiming for? If you want your sample to have the exact same proportion of classes as the original, then neither sample or sampleByKey will provide that. Looking at the documentation, it seems that indeed sampleByKeyExact will do the trick.
edit detailing the behavior of sample and sampleByKey:
For sample, a map operation basically goes over every element and based on a random variable decides whether to keep the item (and how many copies in case withReplacement == True). This random variable is i.i.d across all elements. In sampleByKey, the random variable is independent but has a different distribution based on the key value, or more accurately based on the corresponding value in the fractions argument. If the values in fractions are identical, this random variable will have the same distribution for all key values - that is why the behavior becomes identical for sample and sampleByKey.

How to get a list/vector of which clusters a node in a network has belonged to when the clusters change at each timestep?

I have used kmeans to cluster a population in Matlab and then I run a disease in the population and nodes that have the disease more than 80% of the time are excluded from the clustering - meaning the clusters change each iteration. Then it reclusters over 99 timesteps. How do I create a vector/list of which clusters a specific node has belonged to over the whole time period?
I have tried using the vector created by kmeans called 'id' but this doesn't include the nodes that are excluded from the clustering so I cannot track one specific node as the size of id changes each time. This is what I tried and ran it in the for loop so it plotted a line plot for each iteration:
nt = [nt sum(id(1,:))];
Only problem was that the first row in the vector obviously changed every timestep so it wasn't the same person.
This is my initial simple clustering:
%Cluster the population according to these features
[id, c] = kmeans(feats', 5);
Then this is the reclustering process to exclude those who have the disease for more than 80% of the time (this part is in a big for loop in the whole code):
Lc = find(m < 0.8);
if t > 1,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5, 'Start', c);
else,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5);
end;
I want to be able to plot and track the fate of specific nodes in my population which is why I want to know how their cluster groups change. Any help would be much appreciated!

Is possible to use a Database for storing machine-learning data ? if so, how?

I am new to machine learning. I would like to set a machine on my server and using a database for storing data learnt.
var colorClassifier = new NeuralNetwork();
colorClassifier.trainBatch([
{input: { r: 0.03, g: 0.7, b: 0.5 }, output: 0}, // black
{input: { r: 0.16, g: 0.09, b: 0.2 }, output: 1}, // white
{input: { r: 0.5, g: 0.5, b: 1.0 }, output: 1} // white
]);
console.log(colorClassifier.classify({ r: 1, g: 0.4, b: 0 })); // 0.99 - almost
The codes for machine learning frameworks are something like this, but i want that 'colorClassifies' is logically stored in my Database and not in the memory, so that i can train this machine during time without losing old data about old trains. I don't really know how these frameworks internally work, but i think it's possible to do something like what i am asking for. thank you
"I don't really know how these frameworks internally work"
However the framework works, training a classifier means finding a set a weight values such that the classifier works well (usually this means minimizing the sum of squared errors). So, a trained classifier is essentially a set of real numbers. To persist the classifier you need to store these numbers to your database.
Each weight can be described by four numbers:
The layer number (integer): The first layer is the input, the rest are the hidden layers (usually one or two), in the order they appear.
From, to (integers): Since each weight connects two nodes, the serial number of these nodes inside each layer
The value of the weight (usually real number)
If, for example, you have a weight with value 5.8 going from the 3rd node of the 2nd layer to the 5th node of the next layer, you can store it in a table
layer: 2
from_node: 3
to_node: 5
value: 5.8
By repeating this for all weights (a simple for loop) you can store the trained network in a simple table. I don't know how your framework works, but normally there will be a member function that returns the weights of the trained network in list or hashmap format

Generate data from kmean's clusters

So I have an input vector, A which is a row vector with 3,000 data points. Using MATLAB, I found 3 cluster centres for A.
Now that I have the 3 cluster centres, I have another row Vector B with 3000 points. The elements of B have one of three values: 1, 2 or 3. So say for e.g if the first 5 elements of B are
B(1,1:5) = [ 1 , 3, 3, 2, 1]
This means that B(1,1) belongs to cluster 1, B(1,2) belongs to cluster 3 etc. What I am trying to do is for every data point in the row vector B, I look at what cluster it belongs to by reading its value and then replace it with a data value from that cluster.
So after the above is done, the first 5 elements of B would look like:
B(1,1:5) = [ 2.7 , 78.4, 55.3, 19, 0.3]
Meaning that B(1,1) is a data value picked from the first cluster (that we got from A), B(1,2) is a data value picked from the third cluster (that we got from A) etc.
k-means only keeps means, it does not model the data distribution.
You cannot generate artificial data sensibly from k-means clusters without additional statistics and distribution assumptions.

Finding similar products using LSH on structured data

I am trying to build a similar product using LSH and I have following query.
My data has following schema
id: long,
title: string,
description: string,
category: string,
price: double,
inventory_count: int,
active: boolean,
date_added: datetime
Should I perform LSH on individual features separately and then combine them in some way, may be weighted average?
or
Should I go about building LSH on all features all together (basically attaching feature name while creating shingles like title_iphone, title_nexus, price_1200.25, active_1...) and then using bag-of-words approach perform LSH on this bag?
If someone can direct me to a document where I can figure out how to perform LSH on structured data like of ecommerce it will be great.
P.S. I'm planning to use spark and min-hash function in LSH. Let me know if you need any more details.
I would go with your first approach but concatenate the binary codes we obtained from each individual LSH-hash instead of averaging them.
For instance, suppose you use 4 bits to represent the hash ( for each feature family) :
data_0:
hash(id) 0101
hash(title) 1001
hash(date_added) 0001
hash(data_0) = 0101,1001,0001
weighted_average = (5+9+1)/3 = 15/3 = 5
Now suppose your have another hash for data_1:
hash(data_1) = 111100000000
weighted_average = (15+0+0)/3= 15/3 = 5
In your retrieval process, the similarity search could be performed by first compute the hash for the query data: for instance,
hash(data_x) = 010010000011
weighted_average = (4+8+3)/3 = 15/3 = 5
Suppose you found out that data_1 and data_0 are the only two data pieces that have been hashed to the same bucket as data_x, then you only need to compute the hamming distance (which can be calculated
using bitwise operator XOR) between
data_1 and data_x -> hamming distance = 6, similarity = 6/12
data_0 and data_x -> hamming distance = 3, similarity = 9/12
So in this example, data_0 is the most similar data to your query.
NOTE you will lose the similarity info encoded in individual binary codes if you average them. See the examples above, you would get the same encoding for data_1 and data_0, which is 5 or 1001. However, if you look at each individual feature, apparently data_1 is more different from data_x than data_0.
ALSO NOTE If you feel some feature family is more important thus it worths more weight, you can use more bits for that feature family.