Euclidean distance in spark 2.1 - scala

I'm trying to calculate euclidean distance of two vectors. I have the following dataframe:
root
|-- h: string (nullable = true)
|-- id: string (nullable = true)
|-- sid: string (nullable = true)
|-- features: vector (nullable = true)
|-- episodeFeatures: vector (nullable = true)
import org.apache.spark.mllib.util.{MLUtils}
val jP2 = jP.withColumn("dist", MLUtils.fastSquaredDistance("features", 5, "episodeFeatures", 5))
I get an error like so:
error: method fastSquaredDistance in object MLUtils cannot be accessed in object org.apache.spark.mllib.util.MLUtils
Is there a way to access that private method?

MLUtils are internal package, and even if wasn't for that, it couldn't be used on Columns or (guessing from the version) ml vectors. You have to design your own udf:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vector
val euclidean = udf((v1: Vector, v2: Vector) => ???) // Fill with preferred logic
val jP2 = jP.withColumn("dist", euclidean($"features", $"episodeFeatures"))

Related

scala: read csv of documents, create cosine similarity

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

apache spark add column which is a complex calculation

I have a following dataset df1 in Spark:
root
|-- id: integer (nullable = true)
|-- t: string (nullable = true)
|-- x: double (nullable = false)
|-- y: double (nullable = false)
|-- z: double (nullable = false)
and I need to create a column which will be a kind of calculation result of
sqrt(x)+cqrt(y)+z*constantK
I'm trying something like following:
val constantK=100500
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
however, I got a type mismatch error
<console>:59: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Double
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
what is the proper way of adding columns with complex calculations which are based on the values of other columns in the dataframe?
Because you are trying tu use Scala.math functions in Spark SQL. SparkSQL has its own operations and types:
import org.apache.spark.sql.functions.sqrt
df1.select($"id", (sqrt($"x")+sqrt($"y")+$"z"*constantK ))
The operator '*' is supported. Take a look to https://spark.apache.org/docs/2.3.0/api/sql/index.html

How is the VectorAssembler used with Sparks Correlation util?

I'm trying to correlate a couple columns of a dataframe in spark scala by piping the columns of the original dataframe into the VectorAssembler followed by the Correlation util. For some reason the Vector assembler seems to be producing empty vectors as seen below. Here's what I have so far.
val numericalCols = Array(
"price", "bedrooms", "bathrooms",
"sqft_living", "sqft_lot"
)
val data: DataFrame = HousingDataReader(spark)
data.printSchema()
/*
...
|-- price: decimal(38,18) (nullable = true)
|-- bedrooms: decimal(38,18) (nullable = true)
|-- bathrooms: decimal(38,18) (nullable = true)
|-- sqft_living: decimal(38,18) (nullable = true)
|-- sqft_lot: decimal(38,18) (nullable = true)
...
*/
println("total record:"+data.count()) //total record:21613
val assembler = new VectorAssembler().setInputCols(numericalCols)
.setOutputCol("features").setHandleInvalid("skip")
val df = assembler.transform(data).select("features","price")
df.printSchema()
/*
|-- features: vector (nullable = true)
|-- price: decimal(38,18) (nullable = true)
*/
df.show
/* THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
*/
println("df row count:" + df.count())
// df row count:21613
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head //ERROR HERE
println("Pearson correlation matrix:\n" + coeff1.toString)
this ends up with the following exception
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
at
Looks like any one of your feature columns contains a null value always. setHandleInvalid("skip") will skip any row that contains null in one of the features. Can you try filling the null values with fillna(0) and check the result. This must solve your issue.

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()

how do I cast field from double to float and round it using pyspark

I have two dataframes with the schema as below:
books_with_10_ratings_or_more_with_title:
root
|-- ISBN: string (nullable = true)
|-- count: long (nullable = false)
|-- average: double (nullable = true)
and
books_df:
root
|-- ISBN: string (nullable = true)
|-- count: long (nullable = false)
|-- average: double (nullable = true)
I tried to join them together and change the rating(i.e. average) to float
books_with_10_ratings_or_more_with_title = books_with_10_ratings_or_more.join(books_df, 'ISBN').select('ISBN', 'Book-Title', 'Book-Author', 'Year', books_with_10_ratings_or_more.average.cast(float))
so I can round it with the following code, it throws an error:
unexpected type:
What's wrong with the code and how do I fix it? Thank you very much.
You can either do
books_with_10_ratings_or_more.average.cast('float')
or
from pyspark.sql.types import FloatType
books_with_10_ratings_or_more.average.cast(FloatType())
There is an example in the official API doc
EDIT
So you tried to cast because round complained about something not being float. You don't have to cast, because your rounding with three digits doesn't make a difference with FloatType or DoubleType.
Your round won't work because you are using the function from python. You need to import it from pyspark.sql.functions. For example,
from pyspark.sql.types import Row
from pyspark.sql.functions import col, round
df = sc.parallelize([
Row(isbn=1, count=1, average=10.6666666),
Row(isbn=2, count=1, average=11.1111111)
]).toDF()
df.select(round(col('average'), 3).alias('average')).collect()