I have a DataFrame with Icao column with hex codes that I would like to convert to Long datatype. How could I do this in Spark SQL?
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
TL;DR Use conv standard function.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
With conv a solution could be as follows:
scala> icao.show
+------+-----+
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
+------+-----+
// conv is not available by default unless you're in spark-shell
import org.apache.spark.sql.functions.conv
val s1 = icao.withColumn("conv", conv($"Icao", 16, 10))
scala> s1.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
conv has a feature of giving you a result of the type of the input column, so I started with strings and got strings.
scala> s1.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: string (nullable = true)
If I had used ints I'd have got ints.
You could cast the result of conv using another built-in method cast (or start with a proper type of the input column).
val s2 = icao.withColumn("conv", conv($"Icao", 16, 10) cast "long")
scala> s2.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: long (nullable = true)
scala> s2.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
You can use java hex to Long converter
java.lang.Long.parseLong(hex.trim(), 16)
All you need is to define a udf function as below
import org.apache.spark.sql.functions.udf
def hexToLong = udf((hex: String) => java.lang.Long.parseLong(hex.trim(), 16))
And call the udf function using .withColumn api
df.withColumn("Icao", hexToLong($"Icao")).show(false)
Related
There is a Spark DF with schema -
root
|-- array_name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- string_element string (nullable = true)
| | |-- double_element: double (nullable = true)
For a row of the df -
Results in an array of values that look like the following, where each index is another array of 2 items (of type String and Double, respectively), however, the type is Any instead of WrappedArray
val wrappedArray = row.get(0)
[WrappedArray([2020-08-01,3.1109513062027883E7], [2020-09-01,2.975389347485656E7], [2020-10-01,3.0935205531489465E7], ...]
The end goal is to convert this into a Map where the String is the key and the Double (converted to Int) is the value
Final variable should look like this -
val finalMap: Map[String, Int] = Map("2020-08-01" -> 31109513, "2020-09-01" -> 29753893, "2020-10-01" -> 30935205, ...)
The main issue is, Spark returns an Any object, so the wrappedArray's compile time type is actually non-iterable, I just want to be able to loop through wrappedArray, which Any is not allowing.
I have a following dataset df1 in Spark:
root
|-- id: integer (nullable = true)
|-- t: string (nullable = true)
|-- x: double (nullable = false)
|-- y: double (nullable = false)
|-- z: double (nullable = false)
and I need to create a column which will be a kind of calculation result of
sqrt(x)+cqrt(y)+z*constantK
I'm trying something like following:
val constantK=100500
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
however, I got a type mismatch error
<console>:59: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Double
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
what is the proper way of adding columns with complex calculations which are based on the values of other columns in the dataframe?
Because you are trying tu use Scala.math functions in Spark SQL. SparkSQL has its own operations and types:
import org.apache.spark.sql.functions.sqrt
df1.select($"id", (sqrt($"x")+sqrt($"y")+$"z"*constantK ))
The operator '*' is supported. Take a look to https://spark.apache.org/docs/2.3.0/api/sql/index.html
I have a Scala Spark dataframe with four columns (all string type) - P, Q, R, S - and a primary key (called PK) (integer type).
Each of these 4 columns may have null values. The left to right ordering of the columns is the importance/relevance of the column and needs to be preserved. The structure of the base dataframe stays the same as shown.
I want the final output to be as follows:
root
|-- PK: integer (nullable = true)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: string (nullable = true)
|-- categoryList: array (nullable = true)
| |-- myStruct: struct (nullable = true)
| | |-- category: boolean (nullable = true)
| | |-- relevance: boolean (nullable = true)
I need to create a new column derived from the 4 columns P, Q, R, S based on the following algorithm:
For every element in each of the four rows, check whether the element exists in Map "mapM"
If element exists, the "category" in the struct will be the corresponding value from map M. If the element does not exist in Map M, the category shall be null.
The "relevance" in the struct shall be the order of the column from left to right: P -> 1, Q -> 2, R -> 3, S -> 4.
The array formed by these four structs is then added to a new column on the dataframe provided.
I'm new to Scala and here is what I have until now:
case class relevanceCaseClass(category: String, relevance: Integer)
def myUdf = udf((code: String, relevance: Integer) => relevanceCaseClass(mapM.value.getOrElse(code, null), relevance))
df.withColumn("newColumn", myUdf(col("P/Q/R/S"), 1))
The problem with this is that I cannot pass the value of the ordering inside the withColumn function. I need to let the myUdf function know the value of the relevance. Am I doing something fundamentally wrong?
Thus I should get the output:
PK P Q R S newCol
1 a b c null array(struct("a", 1), struct(null, 2), struct("c", 3), struct(null, 4))
Here, the value "b" was not found in the map and hence the value (for category) is null. Since the value for column S was already null, it stayed null. The relevance is according to the left-right column ordering.
Given a input dataframe (testing as given in OP) as
+---+---+---+---+----+
|PK |P |Q |R |S |
+---+---+---+---+----+
|1 |a |b |c |null|
+---+---+---+---+----+
root
|-- PK: integer (nullable = false)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: null (nullable = true)
and a broadcasted Map as
val mapM = spark.sparkContext.broadcast(Map("a" -> "a", "c" -> "c"))
You can define the udf function and call that udf function as below
def myUdf = udf((pqrs: Seq[String]) => pqrs.zipWithIndex.map(code => relevanceCaseClass(mapM.value.getOrElse(code._1, "null"), code._2+1)))
val finaldf = df.withColumn("newColumn", myUdf(array(col("P"), col("Q"), col("R"), col("S"))))
with case class as in OP
case class relevanceCaseClass(category: String, relevance: Integer)
which should give you your desired output i.e. finaldf would be
+---+---+---+---+----+--------------------------------------+
|PK |P |Q |R |S |newColumn |
+---+---+---+---+----+--------------------------------------+
|1 |a |b |c |null|[[a, 1], [null, 2], [c, 3], [null, 4]]|
+---+---+---+---+----+--------------------------------------+
root
|-- PK: integer (nullable = false)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: null (nullable = true)
|-- newColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- category: string (nullable = true)
| | |-- relevance: integer (nullable = true)
I hope the answer is helpful
You can pass multiple columns to udf as the following example code
case class Relevance(category: String, relevance: Integer)
def myUdf = udf((p: String,q: String,s: String,r: String) => Seq(
Relevance(mapM.value.getOrElse(p, null), 1),
Relevance(mapM.value.getOrElse(q, null), 2),
Relevance(mapM.value.getOrElse(s, null), 3),
Relevance(mapM.value.getOrElse(r, null), 4)
))
df.withColumn("newColumn", myUdf(df("P"),df("Q"),df("S"),df("R")))
my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()
I have next DataFrame:
df.show()
+---------------+----+
| x| num|
+---------------+----+
|[0.1, 0.2, 0.3]| 0|
|[0.3, 0.1, 0.1]| 1|
|[0.2, 0.1, 0.2]| 2|
+---------------+----+
This DataFrame has follow Datatypes of columns:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: double (containsNull = true)
|-- num: long (nullable = true)
I try to convert currently the DoubleArray inside of DataFrame to the FloatArray. I do it with the next statement of udf:
val toFloat = udf[(val line: Seq[Double]) => line.map(_.toFloat)]
val test = df.withColumn("testX", toFloat(df("x")))
This code is currently not working. Can anybody share with me the solution how to change the array Type inseide of DataFrame?
What I want is:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: float (containsNull = true)
|-- num: long (nullable = true)
This question is based on the question How tho change the simple DataType in Spark SQL's DataFrame
Your udf is wrongly declared. You should write it as follows :
val toFloat = udf((line: Seq[Double]) => line.map(_.toFloat))