apache spark add column which is a complex calculation - scala

I have a following dataset df1 in Spark:
root
|-- id: integer (nullable = true)
|-- t: string (nullable = true)
|-- x: double (nullable = false)
|-- y: double (nullable = false)
|-- z: double (nullable = false)
and I need to create a column which will be a kind of calculation result of
sqrt(x)+cqrt(y)+z*constantK
I'm trying something like following:
val constantK=100500
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
however, I got a type mismatch error
<console>:59: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Double
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
what is the proper way of adding columns with complex calculations which are based on the values of other columns in the dataframe?

Because you are trying tu use Scala.math functions in Spark SQL. SparkSQL has its own operations and types:
import org.apache.spark.sql.functions.sqrt
df1.select($"id", (sqrt($"x")+sqrt($"y")+$"z"*constantK ))
The operator '*' is supported. Take a look to https://spark.apache.org/docs/2.3.0/api/sql/index.html

Related

How to convert a column from hex string to long?

I have a DataFrame with Icao column with hex codes that I would like to convert to Long datatype. How could I do this in Spark SQL?
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
TL;DR Use conv standard function.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
With conv a solution could be as follows:
scala> icao.show
+------+-----+
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
+------+-----+
// conv is not available by default unless you're in spark-shell
import org.apache.spark.sql.functions.conv
val s1 = icao.withColumn("conv", conv($"Icao", 16, 10))
scala> s1.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
conv has a feature of giving you a result of the type of the input column, so I started with strings and got strings.
scala> s1.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: string (nullable = true)
If I had used ints I'd have got ints.
You could cast the result of conv using another built-in method cast (or start with a proper type of the input column).
val s2 = icao.withColumn("conv", conv($"Icao", 16, 10) cast "long")
scala> s2.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: long (nullable = true)
scala> s2.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
You can use java hex to Long converter
java.lang.Long.parseLong(hex.trim(), 16)
All you need is to define a udf function as below
import org.apache.spark.sql.functions.udf
def hexToLong = udf((hex: String) => java.lang.Long.parseLong(hex.trim(), 16))
And call the udf function using .withColumn api
df.withColumn("Icao", hexToLong($"Icao")).show(false)

Euclidean distance in spark 2.1

I'm trying to calculate euclidean distance of two vectors. I have the following dataframe:
root
|-- h: string (nullable = true)
|-- id: string (nullable = true)
|-- sid: string (nullable = true)
|-- features: vector (nullable = true)
|-- episodeFeatures: vector (nullable = true)
import org.apache.spark.mllib.util.{MLUtils}
val jP2 = jP.withColumn("dist", MLUtils.fastSquaredDistance("features", 5, "episodeFeatures", 5))
I get an error like so:
error: method fastSquaredDistance in object MLUtils cannot be accessed in object org.apache.spark.mllib.util.MLUtils
Is there a way to access that private method?
MLUtils are internal package, and even if wasn't for that, it couldn't be used on Columns or (guessing from the version) ml vectors. You have to design your own udf:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vector
val euclidean = udf((v1: Vector, v2: Vector) => ???) // Fill with preferred logic
val jP2 = jP.withColumn("dist", euclidean($"features", $"episodeFeatures"))

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()

how do I cast field from double to float and round it using pyspark

I have two dataframes with the schema as below:
books_with_10_ratings_or_more_with_title:
root
|-- ISBN: string (nullable = true)
|-- count: long (nullable = false)
|-- average: double (nullable = true)
and
books_df:
root
|-- ISBN: string (nullable = true)
|-- count: long (nullable = false)
|-- average: double (nullable = true)
I tried to join them together and change the rating(i.e. average) to float
books_with_10_ratings_or_more_with_title = books_with_10_ratings_or_more.join(books_df, 'ISBN').select('ISBN', 'Book-Title', 'Book-Author', 'Year', books_with_10_ratings_or_more.average.cast(float))
so I can round it with the following code, it throws an error:
unexpected type:
What's wrong with the code and how do I fix it? Thank you very much.
You can either do
books_with_10_ratings_or_more.average.cast('float')
or
from pyspark.sql.types import FloatType
books_with_10_ratings_or_more.average.cast(FloatType())
There is an example in the official API doc
EDIT
So you tried to cast because round complained about something not being float. You don't have to cast, because your rounding with three digits doesn't make a difference with FloatType or DoubleType.
Your round won't work because you are using the function from python. You need to import it from pyspark.sql.functions. For example,
from pyspark.sql.types import Row
from pyspark.sql.functions import col, round
df = sc.parallelize([
Row(isbn=1, count=1, average=10.6666666),
Row(isbn=2, count=1, average=11.1111111)
]).toDF()
df.select(round(col('average'), 3).alias('average')).collect()

Replacing null values with 0 after spark dataframe left outer join

I have two dataframes called left and right.
scala> left.printSchema
root
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
scala> right.printSchema
root
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)
Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here.
https://gist.github.com/anonymous/f02bd79528ac75f57ae8
scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")
scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)
Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right.
scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect
realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null])
I want to replace the null values in the realLabelVal column with 1.0.
Currently I do the following:
I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0.
(This gives me a RDD[Row])
Then I apply the schema of the joined dataframe to get the cleaned dataframe.
The code is as follows:
val real_labelval_index = 3
def replaceNull(row: Row) = {
val rowArray = row.toSeq.toArray
rowArray(real_labelval_index) = 1.0
Row.fromSeq(rowArray)
}
val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row)
val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema)
Is there an elegant or efficient way to do this?
Goolging hasn't helped much.
Thanks in advance.
Have you tried using na
joinedData.na.fill(1.0, Seq("real_labelval"))