Compute similarity in pyspark - pyspark

I have a csv file contains some data, I want select the similar data with an input.
my data is like:
H1 | H2 | H3
--------+---------+----------
A | 1 | 7
B | 5 | 3
C | 7 | 2
And the data point that I want find data similar to that in my csv is like : [6, 8].
Actually I want find rows that H2 and H3 of data set is similar to input, and It return H1.
I want use pyspark and some similarity measure like Euclidean Distance, Manhattan Distance, Cosine Similarity or machine learning algorithm.

Related

How to define neighborhoods in hexagonal Self-Organizing Maps (SOM)

I'm trying to implement a bi-dimensional SOM with shrinking neighborhoods, but to avoid computing the neighborhood function to each neuron for every input, I want to define the neighbors for each neuron since the construction of the lattice. I mean, when creating the SOM, I would add each neighbor to a list within the neurons, so when a neuron is selected as BMU, I only have to apply the neighborhood function to the neurons in that BMU's list.
The problem is to define the topology of an hexagonal lattice within a bi-dimensional array, which is the structure that I'm using for the SOM, cause to achieve the hexagonal distribution I would have to do something like this:
n1 | null | n2 | null | n3
null | n4 | null | n5 | null
n6 | null | n7 | null | n8
Is it correct to create the array like that or there is a way to create a normal array and adjust de indexes?

How do I implement dynamically sized data structures in MATLAB?

I am trying to implement a dynamically sized data structure in MATLAB.
I have a 2D plane with nodes. For each node I need to save the coordinates and the coordinates of the nodes around it, within a distance of e.g. 100.
Imagine a circle with a radius of 100 around each node. I want to store all nodes within this circle for every node.
For example:
-----------------------------------------------
| |
| x |
| x |
| |
| x |
| x |
| x |
| |
| x |
-----------------------------------------------
I tried to implement this as shown below. I create a NodeList which contains a NodeStruct for every node. Every NodeStruct contains the coordinates of its corresponding node, as well as an array of the nodes around it. The problem with the implementation which I had in mind is, that the variable NodeStruct.NextNode changes its size for every Node.
I have an idea on how to find all the nodes, my problem is the datastructure to store all the necessary information.
NodeList = [];
NodeStruct.Coords = [];
NodeStruct.NextNode = [];
You can create a struct array that you index as follows:
NodeStruct(3).Coords = [x,y];
NodeStruct(3).NextNode = [1,2,6,10];
However, it is likely that this is better solved with an adjacency matrix. That is an NxN matrix, with N the number of nodes, and where adj(i,j) is true if nodes i and j are within the given distance of each other. In this case, the adjacency matrix is symmetric, but it doesn't need to be if you list, for example, the 10 nearest nodes for each node. That case can also be handled with the adjacency matrix.
Given an Nx2 matrix with coordinates coord, where each row is the coordinates for one node, you can write:
dist = sqrt(sum((reshape(coord,[],1,2) - reshape(coord,1,[],2)).^2, 3));
adj = dist < 100; % or whatever your threshold is

Cosine-similarity between columns in a Spark dataframe

I have data that looks like this...
+-----------+--------------------+
| searchterm| title|
+-----------+--------------------+
|red ball |A big red ball |
|red ball |A small blue ball |
|... |... |
+-----------+--------------------+
I'm trying to find the cosine similarity between the searchterm column and the title column in Scala. I'm able to tokenize each column without issue, but most similarity implementations I have found online operate across rows and not across columns, i.e. they would compare 'a big red ball' with 'a small blue ball' rather than the cross column comparison I actually want. Any ideas? I'm very new to Scala, but this is how I would do it in Python.
def get_text_cosine_similarity(row):
# Form TF-IDF matrix
text_arr = row[['searchterm', 'title']].values
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(text_arr)
# Get cosine similarity 'score', assuming keyword is at index 0
similarity_scores = cosine_similarity(tfidf_matrix[0], tfidf_matrix)
return pd.Series(similarity_scores[0][1:])
df[['title_cs']] = df.apply(get_text_cosine_similarity, axis=1)
Using sklearn.metrics.pairwise.cosine_similarity and sklearn.feature_extraction.text.TfidfVectorizer
You could transpose the matrix and then do the cosine similarity

How does the Graphite summarize function with avg work?

I'm trying to figure out how the Graphite summarize function works. I've the following data points, where X-axis represents time, and Y-axis duration in ms.
+-------+------+
| X | Y |
+-------+------+
| 10:20 | 0 |
| 10:30 | 1585 |
| 10:40 | 356 |
| 10:50 | 0 |
+-------+------+
When I pick any time window on Grafana more than or equal to 2 hours (why?), and apply summarize('1h', avg, false), I get a triangle starting at (9:00, 0) and ending at (11:00, 0), with the peak at (10:00, 324).
A formula that a colleague came up with to explain the above observation is as follows.
Let:
a = Number of data points for a peak, in this case 4.
b = Number of non-zero data points, in this case 2.
Then avg = sum / (a + b). It produces (1585+356) / 6 = 324 but doesn't match with the definition of any mean I know of. What is the math behind this?
Your data is at 10 minute intervals, so there are 6 points in each 1hr period. Graphite will simply take the sum of the non-null values in each period divided by the count (standard average). If you look at the raw series you'll likely find that there are also zero values at 10:00 and 10:10

Simulation of custom probability distribution in Matlab

I'm trying to simulate the following distribution:
a | 0 | 1 | 7 | 11 | 13
-----------------------------------------
p(a) | 0.34 | 0.02 | 0.24 | 0.29 | 0.11
I already simulated a similar problem: four type of balls with chances of 0.3, 0.1, 0.4 and 0.2. I created a vector F = [0 0.3 0.4 0.8 1] and used repmat to grow it by 1000 rows. Then I compared it with a columnvector of 1000 random numbers grown with 5 columns using the same repmat approach. I compared those two, calculated the sumvector of the matrix, and calculated the difference to get the frequences (e.g. [301 117 386 196]). .
But with the current distribution I don't know how to create the initial matrix F and whether I can use the same approach I used before at all.
I need the answer to be "vectorised", so no (for, while or if) loops.
This question on math.stackexchange
What if you create the following arrays:
largeNumber = 1000000;
a=repmat( [0], 1, largeNumber*0.34 );
b=repmat( [1], 1, largeNumber*0.02 );
% ...
e=repmat( [13], 1, largeNumber*0.11 );
Then you concatenate all of these arrays (to get a single array where your entries are represented with their corresponding probabilities), shuffle them, and extract the first N elements to get an N-dimensional vector drawn from your distribution.
EDIT: of course this answer is the way to go.