Cosine-similarity between columns in a Spark dataframe - scala

I have data that looks like this...
+-----------+--------------------+
| searchterm| title|
+-----------+--------------------+
|red ball |A big red ball |
|red ball |A small blue ball |
|... |... |
+-----------+--------------------+
I'm trying to find the cosine similarity between the searchterm column and the title column in Scala. I'm able to tokenize each column without issue, but most similarity implementations I have found online operate across rows and not across columns, i.e. they would compare 'a big red ball' with 'a small blue ball' rather than the cross column comparison I actually want. Any ideas? I'm very new to Scala, but this is how I would do it in Python.
def get_text_cosine_similarity(row):
# Form TF-IDF matrix
text_arr = row[['searchterm', 'title']].values
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(text_arr)
# Get cosine similarity 'score', assuming keyword is at index 0
similarity_scores = cosine_similarity(tfidf_matrix[0], tfidf_matrix)
return pd.Series(similarity_scores[0][1:])
df[['title_cs']] = df.apply(get_text_cosine_similarity, axis=1)
Using sklearn.metrics.pairwise.cosine_similarity and sklearn.feature_extraction.text.TfidfVectorizer

You could transpose the matrix and then do the cosine similarity

Related

Compute similarity in pyspark

I have a csv file contains some data, I want select the similar data with an input.
my data is like:
H1 | H2 | H3
--------+---------+----------
A | 1 | 7
B | 5 | 3
C | 7 | 2
And the data point that I want find data similar to that in my csv is like : [6, 8].
Actually I want find rows that H2 and H3 of data set is similar to input, and It return H1.
I want use pyspark and some similarity measure like Euclidean Distance, Manhattan Distance, Cosine Similarity or machine learning algorithm.

How to define neighborhoods in hexagonal Self-Organizing Maps (SOM)

I'm trying to implement a bi-dimensional SOM with shrinking neighborhoods, but to avoid computing the neighborhood function to each neuron for every input, I want to define the neighbors for each neuron since the construction of the lattice. I mean, when creating the SOM, I would add each neighbor to a list within the neurons, so when a neuron is selected as BMU, I only have to apply the neighborhood function to the neurons in that BMU's list.
The problem is to define the topology of an hexagonal lattice within a bi-dimensional array, which is the structure that I'm using for the SOM, cause to achieve the hexagonal distribution I would have to do something like this:
n1 | null | n2 | null | n3
null | n4 | null | n5 | null
n6 | null | n7 | null | n8
Is it correct to create the array like that or there is a way to create a normal array and adjust de indexes?

How do I implement dynamically sized data structures in MATLAB?

I am trying to implement a dynamically sized data structure in MATLAB.
I have a 2D plane with nodes. For each node I need to save the coordinates and the coordinates of the nodes around it, within a distance of e.g. 100.
Imagine a circle with a radius of 100 around each node. I want to store all nodes within this circle for every node.
For example:
-----------------------------------------------
| |
| x |
| x |
| |
| x |
| x |
| x |
| |
| x |
-----------------------------------------------
I tried to implement this as shown below. I create a NodeList which contains a NodeStruct for every node. Every NodeStruct contains the coordinates of its corresponding node, as well as an array of the nodes around it. The problem with the implementation which I had in mind is, that the variable NodeStruct.NextNode changes its size for every Node.
I have an idea on how to find all the nodes, my problem is the datastructure to store all the necessary information.
NodeList = [];
NodeStruct.Coords = [];
NodeStruct.NextNode = [];
You can create a struct array that you index as follows:
NodeStruct(3).Coords = [x,y];
NodeStruct(3).NextNode = [1,2,6,10];
However, it is likely that this is better solved with an adjacency matrix. That is an NxN matrix, with N the number of nodes, and where adj(i,j) is true if nodes i and j are within the given distance of each other. In this case, the adjacency matrix is symmetric, but it doesn't need to be if you list, for example, the 10 nearest nodes for each node. That case can also be handled with the adjacency matrix.
Given an Nx2 matrix with coordinates coord, where each row is the coordinates for one node, you can write:
dist = sqrt(sum((reshape(coord,[],1,2) - reshape(coord,1,[],2)).^2, 3));
adj = dist < 100; % or whatever your threshold is

Geometric mean of columns in dataframe

I use this code to compute the geometric mean of all rows within a dataframe :
from pyspark.sql.functions import rand, randn, sqrt
df = sqlContext.range(0, 10)
df = df.select(rand(seed=10).alias("c1"), randn(seed=27).alias("c2"))
df.show()
newdf = df.withColumn('total', sqrt(sum(df[col] for col in df.columns)))
newdf.show()
This displays :
To compute the geometric mean of the columns instead of rows I think this code should suffice :
newdf = df.withColumn('total', sqrt(sum(df[row] for row in df.rows)))
But this throws error : NameError: global name 'row' is not defined
So appears the api for accessing columns is not same as accessing rows.
Should I format the data to convert rows to columns and then re-use working algorithm : newdf = df.withColumn('total', sqrt(sum(df[col] for col in df.columns))) or is there a solution that processes the rows and columns as is ?
I am not sure you definition of geometric mean is correct. According to Wikipedia, the geometric mean is defined as the nth root of the product of n numbers. According to the same page, the geometric mean can also be expressed as the exponential of the arithmetic mean of logarithms. I shall be using this to calculate the geometric mean of each column.
You can calculate the geometric mean, by combining the column data for c1 and c2 into a new column called value storing the source column name in column. After the data has been reformatted, the geometric mean is determined by grouping by column (c1 or c2) and calculating the exponential of the arithmetic mean of the logarithmic value for each group. In this calculation NaN values are ignored.
from pyspark.sql import functions as F
df = sqlContext.range(0, 10)
df = df.select(F.rand(seed=10).alias("c1"), F.randn(seed=27).alias("c2"))
df_id = df.withColumn("id", F.monotonically_increasing_id())
kvp = F.explode(F.array([F.struct(F.lit(c).alias("column"), F.col(c).alias("value")) for c in df.columns])).alias("kvp")
df_pivoted = df_id.select(['id'] + [kvp]).select(['id'] + ["kvp.column", "kvp.value"])
df_geometric_mean = df_pivoted.groupBy(['column']).agg(F.exp(F.avg(F.log(df_pivoted.value))))
df_geometric_mean.withColumnRenamed("EXP(avg(LOG(value)))", "geometric_mean").show()
This returns:
+------+-------------------+
|column| geometric_mean|
+------+-------------------+
| c1|0.25618961513533134|
| c2| 0.415119290980354|
+------+-------------------+
These geometrics means, other than their precision, match the geometric mean return by scipy provided NaN values are ignored as well.
from scipy.stats.mstats import gmean
c1=[x['c1'] for x in df.collect() if x['c1']>0]
c2=[x['c2'] for x in df.collect() if x['c2']>0]
print 'c1 : {0}\r\nc2 : {1}'.format(gmean(c1),gmean(c2))
This snippet returns:
| c1|0.256189615135|
| c2|0.41511929098|

How to fill up line with delimiters in Matlab

I have a cell array A, which I wish to print as a table. The first columns and first row are headers. For example I have
A:
1 2 3
4 5 6
7 8 9
And I want the output to look like this:
A:
1 ||2 |3
-------------
4 ||5 |6
7 ||8 |9
The vertical bars are not problematic. I just dont know how to print out the horizontal line. It should be more flexible then just disp('-------'). It should resize depending on how big my strings in my cells are.
So far I only implemented the ugly way which just displays a static string '-----'.
function [] = dispTable(table)
basicStr = '%s\t| ';
fmt = strcat('%s\t||',repmat(basicStr,1,size(table,2)-1),'\n');
lineHeader = '------';
%print first line as header:
fprintf(1,fmt,table{1,:});
disp(lineHeader);
fprintf(1,fmt,table{2:end,:});
end
Any help is appreciated. Thanks!
You are not going to reliably be able to compute the width of a field since you are using tabs whose width can vary from machine to machine. Also if you're trying to display something in a tabular structure, it's best to avoid tabs just in case two values are different by more than 8 characters which would lead to the columns not lining up.
Rather than using tabs, I would use fixed-width fields for your data, then you know exactly how many - characters to use.
% Construct a format string using fixed-width fonts
% NOTE: You could compute the needed width dynamically based on input
format = ['%-4s||', repmat('%-4s|', 1, size(table, 2) - 1)];
% Replace the last | with a newline
format(end) = char(10);
% You can compute how many hypens you need to span your data
h_line = [repmat('-', [1, 5 * size(table, 2)]), 10];
% Now print the result
fprintf(format, table{1,:})
fprintf(h_line)
fprintf(format, table{2:end,:})
% 1 ||2 |3
% ---------------
% 4 ||7 |5
% 8 ||6 |9