Convert array of vectors to DenseVector

Convert array of vectors to DenseVector - scala

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.

This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)

Related

Spark Bitwise XOR between two DataFrame Columns

I have a Spark DataFrame testDF with the following structure
scala> testDF.printSchema()
-------------------------------------------------
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
Both array1 and array2 are guaranteed to be of same length.
I want to perform bitwise xor between each element from array1 and array2.
Something like:
res = array1[0] ^ array2[0] + array1[1] ^ array2[1] + ...
I know this can be done using udf but wondering if there is a native spark way to do it.
Desired DataFrame Structure would look something like
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- result: long (nullable = true)

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

The idea is to read a parquet file into dataFrame. Then, extract all column name's and type's from it's schema. If we have a nested columns, i would like to add a "prefix" before the column name.
Considering that we can have a nested column with sub column named properly, and we can have also a nested column with just an array of array without column name but "element".
val dfSource: DataFrame = spark.read.parquet("path.parquet")
val dfSourceSchema: StructType = dfSource.schema
Example of dfSourceSchema (Input):
|-- exCar: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: binary (nullable = true)
|-- exProduct: string (nullable = true)
|-- exName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exNameOne: string (nullable = true)
| | |-- exNameTwo: string (nullable = true)
Desired output :
((exCar.prefix.prefix,binary)),(exProduct, String), (exName.prefix.exNameOne, String), (exName.prefix.exNameTwo, String) )

Spark 2 converting scala array to WrappedArray

Spark 2 is converting scala array to WrappedArray automatically When i am passing array to function. However, In Spark 1.6 array is converted to string like '[a,b,c]' . Here is my code
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
When I am running above code to spark 1.6. I am getting below schema
|-- final_data1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- final_data2: array (nullable = true)
| |-- element: string (containsNull = true)
but in spark 2
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
How can I change datatype of spark 2 as per spark 1?

Since you want a string representation of an array, how about casting the array into a string like this?
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1") cast "string").alias("final_data1"),
collect_list(array($"b",$"c",$"data2") cast "string").alias("final_data2"))
It might simply be what your old version of spark was doing. I was not able to verify.

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column

You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()

How to join two dataframes where key to be used for joining has different datatype in both dataframes

I have two dataframes df1, df2 whose schema is as follows:
DF1 is of the form:
slotSize2: struct (nullable = true)
| |-- 120x600: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 160x600: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
| |-- 250x250: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 300x250: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
Dataframe df2 has schema :
root
|-- bidId: array (nullable = true)
| |-- element: string (containsNull = true)
|-- slotSize1: array (nullable = true)
| |-- element: string (containsNull = true)
In dataframe df2 we have slotSize (named as slotSize1) in form of string and in dataframe df1 we have nested form of slotsize i.e. for every slotsize we have corresponding map.
I want to join two dataframes df1, df2 to form a new dataframe df3 which has schema (bidId, slotSize , viewMap) where bidId is present in df1 , slotSIze is of form 120x600 and is present in both schemas, and viewMap corresponds to the nested map corresponding to every slotSize in df1.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert array of vectors to DenseVector - scala

Related

Spark Bitwise XOR between two DataFrame Columns

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

Spark 2 converting scala array to WrappedArray

How to get the first row data of each list?

How to join two dataframes where key to be used for joining has different datatype in both dataframes

Categories

Resources