Pyspark euclidean distance between entry and column - pyspark

I am working with pyspark, and wondering if there is any smart way to get euclidean dstance between one row entry of array and the whole column. For instance, there is a dataset like this.
+--------------------+---+
| features| id|
+--------------------+---+
|[0,1,2,3,4,5 ...| 0|
|[0,1,2,3,4,5 ...| 1|
|[1,2,3,6,7,8 ...| 2|
Choose one of the column i.e. id==1, and calculate the euclidean distance. In this case, the result should be [0,0,sqrt(1+1+1+9+9+9)].
Can anybody figure out how to do this efficiently?
Thanks!

If you want euclidean for a fixed entry with a column, simply do this.
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from scipy.spatial import distance
fixed_entry = [0,3,2,7...] #for example, the entry against which you want distances
distance_udf = F.udf(lambda x: float(distance.euclidean(x, fixed_entry)), FloatType())
df = df.withColumn('distances', distance_udf(F.col('features')))
Your df will have a column of distances.

You can do BucketedRandomProjectionLSH [1] to get a cartesian of distances between your data frame.
from pyspark.ml.feature import BucketedRandomProjectionLSH
brp = BucketedRandomProjectionLSH(
inputCol="features", outputCol="hashes", seed=12345, bucketLength=1.0
)
model = brp.fit(df)
model.approxSimilarityJoin(df, df, 3.0, distCol="EuclideanDistance")
You can also get distances for one row to column with approxNearestNeighbors [2], but the results are limited by numNearestNeighbors, so you could give it the count of the entire data frame.
one_row = df.where(df.id == 1).first().features
model.approxNearestNeighbors(df2, one_row, df.count()).collect()
Also, make sure to convert your data to Vectors!
from pyspark.sql import functions as F
to_dense_vector = F.udf(Vectors.dense, VectorUDF())
df = df.withColumn('features', to_dense_vector('features'))
[1] https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html?highlight=approx#pyspark.ml.feature.BucketedRandomProjectionLSH
[2] https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html?highlight=approx#pyspark.ml.feature.BucketedRandomProjectionLSHModel.approxNearestNeighbors

Here is an implementation using SQL Function power() to compute Euclidean distance between matching rows in two dataframes
cols2Join = ['Key1','Key2']
colsFeature =['Feature1','Feature2','Feature3','Feature4']
columns = cols2Join + colsFeature
valuesA = [('key1value1','key2value1',111,22,33,.334),('key1value3','key2value3', 333,444,12,.445),('key1value5','key2value5',555,666,101,.99),('key1value7','key2value7',777,888,10,.019)]
table1 = spark.createDataFrame(valuesA,columns)
valuesB = [('key1value1','key2value1',22,33,3,.1),('key1value3','key2value3', 88,99,4,1.23),('key1value5','key2value5',4,44,1,.998),('key1value7','key2value7',9,99,1,.3)]
table2= spark.createDataFrame(valuesB,columns)
#Create the sql expression using list comprehension, we use sql function power to compute euclidean distance inline
beginExpr='power(('
InnerExpr = ['power((a.{}-b.{}),2)'.format(x,x) for x in colsFeature]
InnerExpr = '+'.join(str(e) for e in InnerExpr)
endExpr ='),0.5) AS EuclideanDistance'
distanceExpr = beginExpr + InnerExpr + endExpr
Expr = cols2Join+ [distanceExpr]
#now just join the tables and use Select Expr to get Euclidean distance
outDF = table1.alias('a').join(table2.alias('b'),cols2Join,how="inner").selectExpr(Expr)
display(outDF)

If you need to find euclidean distances between only one particular row and every other row in dataframe, then you can filter & collect that row and pass it to udf.
But, if you need to calculate distance between all pairs You need to use join.
Repartition the dataframe by id, it will speed up the join operation. No need to calculate full pairwise matrix, just calculate the upper or lower half and replicate it. I wrote a function for myself based on the this logic.
df = df.repartition("id")
df.cache()
df.show()
#metric = any callable function to calculate distance b/w two vectors
def pairwise_metric(Y, metric, col_name="metric"):
Y2 = Y.select(f.col("id").alias("id2"),
f.col("features").alias("features2"))
# join to create lower or upper half
Y = Y.join(Y2, Y.id < Y2.id2, "inner")
def sort_list(x):
x = sorted(x, key=lambda y:y[0])
x = list(map(lambda y:y[1], x))
return(x)
udf_diff = f.udf(lambda x,y: metric(x,y), t.FloatType())
udf_sort = f.udf(sort_list, t.ArrayType(t.FloatType()))
Yid = Y2.select("id2").distinct().select("id2",
f.col("id2").alias("id")).withColumn("dist", f.lit(0.0))
Y = Y.withColumn("dist", udf_diff("features",
"features2")).drop("features","features2")
# just swap the column names and take union to get the other half
Y =Y.union(Y.select(f.col("id2").alias("id"),
f.col("id").alias("id2"), "dist"))
# union for the diagonal elements of distance matrix
Y = Y.union(Yid)
st1 = f.struct(["id2", "dist"]).alias("vals")
# groupby , aggregate and sort
Y = (Y.select("id",st1).groupBy("id").agg(f.collect_list("vals").
alias("vals")).withColumn("dist",udf_sort("vals")).drop("vals"))
return(Y.select(f.col("id").alias("id1"), f.col("dist").alias(col_name)))

Related

Hierarchical Agglomerative clustering for Spark

I am working on a project using Spark and Scala and I am looking for a hierarchical clustering algorithm, which is similar to scipy.cluster.hierarchy.fcluster or sklearn.cluster.AgglomerativeClustering, which will be useable for large amounts of data.
MLlib for Spark implements Bisecting k-means, which needs as input the number of clusters. Unfortunately in my case, I don't know the number of clusters and I would prefer to use some distance threshold as an input parameter, as it is possible to use in those two python implementations above.
If anyone would know the answer, I would be very grateful.
So I had the same problem and after looking high and low found no answers so I will post what I did here in the hopes that it helps anyone else and that maybe someone will build on it.
The basic idea of what I did was to use bisecting K-means recursively to continue to split clusters in half until all points in the cluster were a specified distance away from the centroid. I was using gps data so I have a little bit of extra machinery to deal with that.
The first step is to create a model that will cut the data in half. I used bisecting K means but I think this would work with any of the pyspark clustering methods so long as you can get the distance to the centroid.
import pyspark.sql.functions as f
from pyspark import SparkContext, SQLContext
from pyspark.ml.clustering import BisectingKMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
bkm = BisectingKMeans().setK(2).setSeed(1)
assembler = VectorAssembler(inputCols=['lat','long'], outputCol="features")
adf = assembler.transform(locAggDf)#locAggDf contains my location info
model = bkm.fit(adf)
# predictions will have the original data plus the "features" col which assigns a cluster number
predictions = model.transform(adf)
predictions.persist()
The next step is our recursive function. The idea here is that we specify some distance from the centroid and if any point in a cluster is farther than that distance we cut the cluster in half. When a cluster is tight enough that it meets the condition I add it to a result array that I use to build the final clustering
def bisectToDist(model, predictions, bkm, precision, result = []):
centers = model.clusterCenters()
# row[0] is predictedClusterNum, row[1] is unit, row[2] point lat, row[3] point long
# centers[row[0]] is the lat long of center, centers[row[0]][0] = lat, centers[row[0]][1] = long
distUdf = f.udf(
lambda row: getDistWrapper((centers[row[0]][0], centers[row[0]][1], row[1]), (row[2], row[3], row[1])),
FloatType())##getDistWrapper(is how I calculate the distance of lat and long but you can define any distance metric)
predictions = predictions.withColumn('dist', distUdf(
f.struct(predictions.prediction, predictions.encodedPrecisionUnit, predictions.lat, predictions.long)))
#create a df of all rows that were in clusters that had a point outside of the threshold
toBig = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(f.col('max(dist)') > self.precision).select(
'prediction'), ['prediction'], 'leftsemi')
#this could probably be improved
#get all cluster numbers that were to big
listids = toBig.select("prediction").distinct().rdd.flatMap(lambda x: x).collect()
#if all data points are within the speficed distance of the centroid we can return the clustering
if len(listids) == 0:
return predictions
# assuming binary class now k must be = 2
# if one of the two clusters was small enough we will not have another recusion call for that cluster
# we must save it and return it at this depth the clustiering that was 2 big will be cut in half in the loop below
if len(listids) == 1:
ok = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(
f.col('max(dist)') <= precision).select(
'prediction'), ['prediction'], 'leftsemi')
for clusterId in listids:
# get all of the pieces that were to big
part = toBig.filter(toBig.prediction == clusterId)
# we now deed to refit the subset of the data
assembler = VectorAssembler(inputCols=['lat', 'long'], outputCol="features")
adf = assembler.transform(part.drop('prediction').drop('features').drop('dist'))
model = bkm.fit(adf)
#predictions now holds the new subclustering and we are ready for recursion
predictions = model.transform(adf)
result.append(bisectToDist(model, predictions, bkm, result=result))
#return anything that was given and already good
if len(listids) == 1:
return ok
Finally we can call the function and build the resulting dataframe
result = []
self.bisectToDist(model, predictions, bkm, result=result)
#drop any nones can happen in recursive not top level call
result =[r for r in result if r]
r = result[0]
r = r.withColumn('subIdx',f.lit(0))
result = result[1:]
idx = 1
for r1 in result:
r1 = r1.withColumn('subIdx',f.lit(idx))
r = r.unionByName(r1)
idx = idx + 1
# each of the subclusters will have a 0 or 1 classification in order to make it 0 - n I added the following
r = r.withColumn('delta', r.subIdx * 100 + r.prediction)
r = r.withColumn('delta', r.delta - f.lag(r.delta, 1).over(Window.orderBy("delta"))).fillna(0)
r = r.withColumn('ddelta', f.when(r.delta != 0,1).otherwise(0))
r = r.withColumn('spacialLocNum',f.sum('ddelta').over(Window.orderBy(['subIdx','prediction'])))
#spacialLocNum should be the final clustering
Admittadly this is quite convoluted and slow but it does get the job done, hope this helps!

calculate the z score for each row in the column of a dataframe using scala / spark

I have a dataframe :
val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}
val cord = DF.select("time","longitude", "latitude","speed")
I want to calculate z score (x-mean)/std of each row of speed column.I calculate the mean and standard deviation :
val std = DF.select(col("speed").cast("double")).as[Double].rdd.stdev()
val mean = DF.select(col("speed").cast("double")).as[Double].rdd.mean()
How to calculate z score for each row of column speed and obtain this result :
+----------------+----------------+-
|A |B |speed | z score
+----------------+----------------+---------------------+
|17/02/2020 00:06| -7.1732833| 50 | z score
|17/02/2020 00:16| -7.1732833| 40 | z score
|17/02/2020 00:26| -7.1732833| 30 | z score
How to do for calcule it for each row.
The best way to perform this is:
df.withColumn("z score", col("speed") - mean / std)
where mean and std are calculated as shown in the question.
Let me know if this helps!!
You can avoid two separate RDD actions of you use window-functions and STDDEV_POP from Hive aggregate functions:
val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}
val cord = DF.select($"time",$"longitude", $"latitude",$"speed".cast("double"))
val result = cord
.withColumn("mean",avg($"speed").over())
.withColumn("stddev",callUDF("stddev_pop",$"speed").over())
.withColumn("z-score",($"speed"-$"mean")/$"stddev")

how can i plot a histogram in pyspark

I am new on pyspark , I have tabe as below, I want to plot histogram of this df , x axis will include “word” by axis will include “count” columns. Do you have any idea ?
word count
Akdeniz’in 14
en 13287
büyük 3168
deniz 1276
festivali: 6
First of all, a histogram is not the correct diagram typ to visualize a word count. Histograms are useful to visualize the distribution of a variable, bar charts in contrary are used to compare variables (Read this article for more information). With the following code you can create a barchart for your example:
from matplotlib import pyplot
l = [( 'Akdeniz’in', 14)
,('en' , 13287)
,('büyük' , 3168)
,('deniz' , 1276)
,('festivali:' , 6)]
df = spark.createDataFrame(l,['word','count'])
#Add values to a list (not recommend when you have a huge dataframe)
bla = df.collect()
#create a numeric value for every label
indexes = list(range(len(bla)))
#split words and counts to different lists
values = [r['count'] for r in bla]
labels = [r['word'] for r in bla]
#Plotting
bar_width = 0.35
pyplot.bar(indexes, values)
#add labels
labelidx = [i + bar_width for i in indexes]
pyplot.xticks(labelidx, labels)
pyplot.show()

Computing distance of a vector from the center of K-means cluster

I have training data set and I ran K-means on it with K=4 and got four cluster centers. For new data point(s), I would like to know not only the predicted cluster but also the distance from that cluster's center. Is there an API to compute the euclidean distance from the center ? I can make 2 API calls, if that is needed. I am using Scala and I couldn't find any example anywhere.
Since Spark 2.0 Vectors.sqdist can be used to calculate the squared distance between two Vectors.
You could use a UDF to calculate for every point the distance from its center, like so:
import org.apache.spark.ml.linalg.{Vectors, Vector}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.sql.functions.udf
// Sample points
val points = Seq(Vectors.dense(1,0), Vectors.dense(2,-3), Vectors.dense(0.5, -1), Vectors.dense(1.5, -1.5))
val df = points.map(Tuple1.apply).toDF("features")
// K-means
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)
val kmeansModel = kmeans.fit(df)
val predictedDF = kmeansModel.transform(df)
// predictedDF.schema = (features: Vector, prediction: Int)
// Cluster Centers
kmeansModel.clusterCenters foreach println
/*
[1.75,-2.25]
[0.75,-0.5]
*/
// UDF that calculates for each point distance from each cluster center
val distFromCenter = udf((features: Vector, c: Int) => Vectors.sqdist(features, kmeansModel.clusterCenters(c)))
val distancesDF = predictedDF.withColumn("distanceFromCenter", distFromCenter($"features", $"prediction"))
distancesDF.show(false)
/*
+----------+----------+------------------+
|features |prediction|distanceFromCenter|
+----------+----------+------------------+
|[1.0,0.0] |1 |0.3125 |
|[2.0,-3.0]|0 |0.625 |
|[0.5,-1.0]|1 |0.3125 |
|[1.5,-1.5]|0 |0.625 |
+----------+----------+------------------+
*/
NOTE: Vectors.sqdist calculates squared distance between 2 Vectors (without square root). If you need Euclidean distance necessarily, you could use Math.sqrt(Vectors.sqdist(...))
The following worked for me ...
def EuclideanDistance(x: Array[Double], y: Array[Double]) = {
scala.math.sqrt((xs zip ys).map { case (x,y) => scala.math.pow(y - x, 2.0) }.sum)
}

Spark: How to run logistic regression using only some features from LabeledPoint?

I have a LabeledPoint on witch I want to run logistic regression:
Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
MapPartitionsRDD[3335] at map at <console>:44
using code:
val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)
My problem is that I don't want to use all of the features from LabeledPoint, but only some of them. I've got a list o features that I wan't to use, for example:
LoF=List(223244,334453...
How can I get only the features that I want to use from LabeledPoint o select them in logistic regression?
Feature selection allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
One way to do what you are seeking is using the ElementwiseProduct.
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
So if we set the weight of the features we want to keep to 1.0 and the others to 0.0, we can say that the remaining resulting features computed by the ElementwiseProduct of the original vector and the 0-1 weight vectors will select the features we need :
import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))
data.toDF.show
// +-----+--------------------+
// |label| features|
// +-----+--------------------+
// | 1.0|[1.0,0.0,3.0,5.0,...|
// | 1.0|[4.0,5.0,6.0,1.0,...|
// | 0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+
// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5
// The indices represent the features we want to keep
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray
// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)
// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)
// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)
// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))
transformedData.toDF.show
// +-----+-------------------+
// |label| features|
// +-----+-------------------+
// | 1.0|(5,[3,4],[5.0,1.0])|
// | 1.0|(5,[3,4],[1.0,2.0])|
// | 0.0| (5,[4],[2.0])|
// +-----+-------------------+
Note:
You noticed that I used the sparse vector representation for space optimization.
features are sparse vectors.