Spark 2 converting scala array to WrappedArray - scala

Spark 2 is converting scala array to WrappedArray automatically When i am passing array to function. However, In Spark 1.6 array is converted to string like '[a,b,c]' . Here is my code
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
When I am running above code to spark 1.6. I am getting below schema
|-- final_data1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- final_data2: array (nullable = true)
| |-- element: string (containsNull = true)
but in spark 2
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
How can I change datatype of spark 2 as per spark 1?

Since you want a string representation of an array, how about casting the array into a string like this?
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1") cast "string").alias("final_data1"),
collect_list(array($"b",$"c",$"data2") cast "string").alias("final_data2"))
It might simply be what your old version of spark was doing. I was not able to verify.

Related

How to cast column with nested array in pyspark

I have this schema in pyspark:
root
|-- SortedLenders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- LenderID: string (nullable = true)
| | |-- MaxProfit: string (nullable = true)
|-- FilteredOutDecisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ApprovedAmount: integer (nullable = true)
| | |-- Reasons: array (nullable = true)
| | | |-- element: integer (containsNull = true)
How do I cast FilteredOutDecisions.Reasons column to double? Thank you, in advance!
Try this:
df = (
df
.withColumn('newFilteredOutDecisions', f.expr('transform(FilteredOutDecisions, element -> struct(element.ApprovedAmount as ApprovedAmount, transform(element.Reasons, value -> cast (value as double)) as Reasons))'))
)

Spark Bitwise XOR between two DataFrame Columns

I have a Spark DataFrame testDF with the following structure
scala> testDF.printSchema()
-------------------------------------------------
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
Both array1 and array2 are guaranteed to be of same length.
I want to perform bitwise xor between each element from array1 and array2.
Something like:
res = array1[0] ^ array2[0] + array1[1] ^ array2[1] + ...
I know this can be done using udf but wondering if there is a native spark way to do it.
Desired DataFrame Structure would look something like
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- result: long (nullable = true)

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

The idea is to read a parquet file into dataFrame. Then, extract all column name's and type's from it's schema. If we have a nested columns, i would like to add a "prefix" before the column name.
Considering that we can have a nested column with sub column named properly, and we can have also a nested column with just an array of array without column name but "element".
val dfSource: DataFrame = spark.read.parquet("path.parquet")
val dfSourceSchema: StructType = dfSource.schema
Example of dfSourceSchema (Input):
|-- exCar: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: binary (nullable = true)
|-- exProduct: string (nullable = true)
|-- exName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exNameOne: string (nullable = true)
| | |-- exNameTwo: string (nullable = true)
Desired output :
((exCar.prefix.prefix,binary)),(exProduct, String), (exName.prefix.exNameOne, String), (exName.prefix.exNameTwo, String) )

Fetch values from nested struct in Spark DF

Spark DF with one column, where each row is of the
Type:
org.apache.spark.sql.Row
Form:
col1: array (nullable = true)
| |-- A1: struct (containsNull = true)
| | |-- B1: struct (nullable = true)
| | | |-- B11: string (nullable = true)
| | | |-- B12: string (nullable = true)
| | |-- B2: string (nullable = true)
I am trying to get the value of
A1->B1->B11.
Any methods to fetch this with the DataFrame APIs or indexing without converting each row into a seq and then iterating through it which affects my performance badly. Any suggestions would be great. Thanks

Convert array of vectors to DenseVector

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.
This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)