How to calculate the cosine similarity of two vectors in PySpark? - pyspark

I am about to compute the cosine similarity of two vectors in PySpark, like
1 - spatial.distance.cosine(xvec, yvec)
but scipy seems to not support the pyspark.ml.linalg.Vector type.

You can use dot and norm methods to calculate this pretty easily:
from pyspark.ml.linalg import Vectors
x = Vectors.dense([1,2,3])
y = Vectors.dense([2,3,5])
1 - x.dot(y)/(x.norm(2)*y.norm(2))
# 0.0028235350472619603
With scipy:
from scipy.spatial.distance import cosine
​
x = np.array([1,2,3])
y = np.array([2,3,5])
cosine(x, y)
# 0.0028235350472619603

Related

Remove noise and smoothen the ecg signal

I am processing Long term afib dataset - https://physionet.org/content/ltafdb/1.0.0/
When I test the 30s strips of this data, my model is not correcting predicting the signals. So I am trying to deal with noise in this dataset. Here how it looks
Here is the code to plot -
def plot_filter_graphs(data,xmin,xmax,order):
from numpy import sin, cos, pi, linspace
from numpy.random import randn
from scipy import signal
from scipy.signal import lfilter, lfilter_zi, filtfilt, butter
from matplotlib.pyplot import plot, legend, show, grid, figure, savefig,xlim
lowcut=1
highcut=35
nyq = 0.5 * 300
low = lowcut / nyq
high = highcut / nyq
b, a = signal.butter(order, [low, high], btype='band')
# Apply the filter to xn. Use lfilter_zi to choose the initial condition
# of the filter.
z = lfilter(b, a,data)
# Use filtfilt to apply the filter.
y = filtfilt(b, a, data)
y = np.flipud(y)
y = signal.lfilter(b, a, y)
y = np.flipud(y)
# Make the plot.
figure(figsize=(16,5))
plot(data,'b',linewidth=1.75)
plot(z, 'r--', linewidth=1.75)
plot( y, 'k', linewidth=1.75)
xlim(xmin,xmax)
legend(('actual',
'lfilter',
'filtfilt'),
loc='best')
grid(True)
show()
I am using butter band pass filter to filter the noise. I also checked with filtfilt and lfilt but that is also not giving good result.
Any suggestion, how noise can be removed so that signal accuracy is good and hense it can be used for model prediction

Using numerical methods to plot solution to first-order nonlinear differential equation in Matlab

I have a question about plotting x(t), the solution to the following differential equation knowing that dx/dt equals the expression below. The value of x is 0 at t = 0.
syms x
dxdt = -(1.0*(6.84e+45*x^2 + 5.24e+32*x - 2.49e+42))/(2.47e+39*x + 7.12e+37)
I want to plot the solution of this first-order nonlinear differential equation. The analytical solution involves complex numbers so that's not relevant because this equation models a real-life process, but Matlab can solve the equation using numerical methods and plot it. Can someone please suggest how to do this?
in matlab try this
tspan = [0 10];
x0 = 0;
[t,x] = ode45(#(t,x) -(1.0*(6.84e+45*x^2 + 5.24e+32*x - 2.49e+42))/(2.47e+39*x + 7.12e+37), tspan, x0);
plot(t,x,'b')
i try it and i got this
hope that help you.
I have written an example for how to use Python with SymPy and matplotlib. SymPy can be used to calculate both definite and indefinite integrals. By calculating the indefinite integral and adding a constant to set it to evaluate to 0 at t = 0. Now you have the integral, so just a matter of plotting. I would define an array from a starting point to an endpoint with 1000 points between (could likely be less). You can then calculate the value of the integral with the constant at each time point, which can then be plotted with matplotlib. There are plenty of other questions on how to customize plots with matplotlib.
This displays a basic plot of the indefinite integral of the function dxdt with assumption of x(t) = 0. Variation of the tuple when running Plotting() will set what range of x values to plot. This is set to plot 1000 data points between the minimum and maximum values set when calling the function.
For more information on customizing the plot, I recommend matplotlib documentation. Documentation on the integral can be found in SymPy documentation.
import pandas as pd
from sympy import *
from sympy.abc import x
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
def Plotting(xValues, dxdt):
# Calculate integral
xt = integrate(dxdt,x)
# Convert to function
f = lambdify(x, xt)
C = -f(0)
# Define x values, last number in linspace corresponding to number of points to plot
xValues = np.linspace(xValues[0],xValues[1],500)
yValues = [f(x)+C for x in xValues]
# Initialize figure
fig = plt.figure(figsize = (4,3))
ax = fig.add_axes([0, 0, 1, 1])
# Plot Data
ax.plot(xValues, yValues)
plt.show()
plt.close("all")
# Define Function
dxdt = -(1.0*(6.84e45*x**2 + 5.24e32*x - 2.49e42))/(2.47e39*x + 7.12e37)
# Run Plotting function, with left and right most points defined as tuple, and function as second argument
Plotting((-0.025, 0.05),dxdt)

How to calculate cosine similarity between two frequency vectors in MATLAB?

I need to find the cosine similarity between two frequency vectors in MATLAB.
Example vectors:
a = [2,3,4,4,6,1]
b = [1,3,2,4,6,3]
How do I measure the cosine similarity between these vectors in MATLAB?
Take a quick look at the mathematical definition of Cosine similarity.
From the definition, you just need the dot product of the vectors divided by the product of the Euclidean norms of those vectors.
% MATLAB 2018b
a = [2,3,4,4,6,1];
b = [1,3,2,4,6,3];
cosSim = sum(a.*b)/sqrt(sum(a.^2)*sum(b.^2)); % 0.9436
Alternatively, you could use
cosSim = (a(:).'*b(:))/sqrt(sum(a.^2)*sum(b.^2)); % 0.9436
which gives the same result.
After reading this correct answer, to avoid sending you to another castle I've added another approach using MATLAB's built-in linear algebra functions, dot() and norm().
cosSim = dot(a,b)/(norm(a)*norm(b)); % 0.9436
See also the tag-wiki for cosine-similarity.
Performance by Approach:
sum(a.*b)/sqrt(sum(a.^2)*sum(b.^2))
(a(:).'*b(:))/sqrt(sum(a.^2)*sum(b.^2))
dot(a,b)/(norm(a)*norm(b))
Each point represents the geometric mean of the computation times for 10 randomly generated vectors.
If you have the Statistics toolbox, you can use the pdist2 function with the 'cosine' input flag, which gives 1 minus the cosine similarity:
a = [2,3,4,4,6,1];
b = [1,3,2,4,6,3];
result = 1-pdist2(a, b, 'cosine');

Eigenvalues of a Laplacian in NetworkX

NetworkX has a decent code example for getting all the eigenvalues of a Laplacian matrix, given below:
import matplotlib.pyplot as plt
import networkx as nx
import numpy.linalg
n = 1000 # 1000 nodes
m = 5000 # 5000 edges
G = nx.gnm_random_graph(n, m)
L = nx.normalized_laplacian_matrix(G)
e = numpy.linalg.eigvals(L.A)
print("Largest eigenvalue:", max(e))
print("Smallest eigenvalue:", min(e))
plt.hist(e, bins=100) # histogram with 100 bins
plt.xlim(0, 2) # eigenvalues between 0 and 2
plt.show()
For the most part I follow all of this until you hit numpy.linalg.eigvals(L.A). What's the .A bit doing? I've looked at the documentation for sparse matrixes in SciPy, but I can't find a reference to this.
L.A is shorthand for L.toarray(). It is the matrix representation of the matrix object.

Solving and Plot Equation in Python

I am kind of new to python. All I am trying to do is to solve for y and plot the function,
In other words, plug values for x and generate y.
y^10+y = x.
Please forgive my ignorance.
from numpy import *
from matplotlib.pyplot import plot, show
y = arange(-10, 10, 0.01) #get values between -10 and 10 with 0.01 step and set to y
x = y**10 + y #get x values from y
plot(x, y)
show()
Using the matplotlib and numpy library: http://scipy.org/
If you want to solve things, use sympy: https://github.com/sympy/sympy/wiki/Quick-examples