Convert WrappedArray into an iterable in Scala - scala

There is a Spark DF with schema -
root
|-- array_name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- string_element string (nullable = true)
| | |-- double_element: double (nullable = true)
For a row of the df -
Results in an array of values that look like the following, where each index is another array of 2 items (of type String and Double, respectively), however, the type is Any instead of WrappedArray
val wrappedArray = row.get(0)
[WrappedArray([2020-08-01,3.1109513062027883E7], [2020-09-01,2.975389347485656E7], [2020-10-01,3.0935205531489465E7], ...]
The end goal is to convert this into a Map where the String is the key and the Double (converted to Int) is the value
Final variable should look like this -
val finalMap: Map[String, Int] = Map("2020-08-01" -> 31109513, "2020-09-01" -> 29753893, "2020-10-01" -> 30935205, ...)
The main issue is, Spark returns an Any object, so the wrappedArray's compile time type is actually non-iterable, I just want to be able to loop through wrappedArray, which Any is not allowing.

Related

Convert array<map<string,string>> type to <string,string> in scala

I am facing in problem in converting a column in my dataframe to string format. The example of the dataframe is as follows:
-- example_code_b: string (nullable = true)
-- example_code: array (nullable = true)
[info] | |-- element: map (containsNull = true)
[info] | | |-- key: string
[info] | | |-- value: string (valueContainsNull = true)
I want to convert example code to (string,string) format from the current array(map(string,string)).
The input is in the form of [Map(entity -> PER), Map(entity -> PER)] and
I want the output to be in the form of PER,PER
you can either do an UDF in DataFrame API or use Dataset-API to do it:
import spark.implicits._
df
.as[Seq[Map[String,String]]]
.map(s => s.reduce(_ ++ _))
.toDF("example_code")
.show()
Note that this does not consider the case of multiple keys, they are not "merged" but just overwritten
You can simply use explode function on any array column, which will create separate rows for each value of array.
val newDF = df.withColumn("mymap" explode(col("example_code")))

Filter a dataFrame based on elemnt from a array column

I'm working with dataframe
root
|-- c: long (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I'm trying to filter this dataframe based on an element ["value1", "key1"] in the array data i.e if this element exist in data of df so keep it else delete it, I tried
df.filter(col("data").contain("["value1", "key1"])
but it didn't work. Also I tried to
put val f=Array("value1", "key1") then df.filter(col("data").contain(f)) it didn't work also.
Any help please?
Straight forward approach would be to use a udf function as udf function helps to perform logics row by row and in primitive datatypes (thats what your requirement suggests to check every key and value of struct element in array data column)
import org.apache.spark.sql.functions._
//udf to check for key1 in key and value1 in value of every struct in the array field
def containsUdf = udf((data: Seq[Row])=> data.exists(row => row.getAs[String]("key") == "key1" && row.getAs[String]("value") == "value1"))
//calling the udf function in the filter
val filteredDF = df.filter(containsUdf(col("data")))
so the filteredDF should be your desired output

How to convert a column from hex string to long?

I have a DataFrame with Icao column with hex codes that I would like to convert to Long datatype. How could I do this in Spark SQL?
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
TL;DR Use conv standard function.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
With conv a solution could be as follows:
scala> icao.show
+------+-----+
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
+------+-----+
// conv is not available by default unless you're in spark-shell
import org.apache.spark.sql.functions.conv
val s1 = icao.withColumn("conv", conv($"Icao", 16, 10))
scala> s1.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
conv has a feature of giving you a result of the type of the input column, so I started with strings and got strings.
scala> s1.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: string (nullable = true)
If I had used ints I'd have got ints.
You could cast the result of conv using another built-in method cast (or start with a proper type of the input column).
val s2 = icao.withColumn("conv", conv($"Icao", 16, 10) cast "long")
scala> s2.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: long (nullable = true)
scala> s2.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
You can use java hex to Long converter
java.lang.Long.parseLong(hex.trim(), 16)
All you need is to define a udf function as below
import org.apache.spark.sql.functions.udf
def hexToLong = udf((hex: String) => java.lang.Long.parseLong(hex.trim(), 16))
And call the udf function using .withColumn api
df.withColumn("Icao", hexToLong($"Icao")).show(false)

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()

Accessing a Nested Map column in Spark Dataframes without using explode

I have a column in a Spark dataframe where the schema looks something like this:
|-- seg: map (nullable = false)
| |-- key: string
| |-- value: array (valueContainsNull = false)
| | |-- element: struct (containsNull = false)
| | | |-- id: integer (nullable = false)
| | | |-- expiry: long (nullable = false)
The value in the column looks something like this:
Map(10000124 -> WrappedArray([20185255,1561507200], [20185256,1561507200]))]
What I want to do it create a column from this Map column which only contain an array of [20185255,20185256] (The elements of the array are 1st element of each array in the WrappedArray). How do I do this ?
I am trying not to use "explode".
** Also is their a way I can use a UDF which take in the Map and get those values ?**