Schema changes when writing Dataframe containing Vector - scala

I am writing a Spark dataframe where one of the column is of Vector datatype as ORC. When I load back the dataframe the schema changes.
var df : DataFrame = spark.createDataFrame(Seq(
(1.0, Vectors.dense(0.0, 1.1, 0.1)),
(0.0, Vectors.dense(2.0, 1.0, -1.0)),
(0.0, Vectors.dense(2.0, 1.3, 1.0)),
(1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")
df.printSchema
df.write.mode(SaveMode.Overwrite).orc("/some/path")
val newDF = spark.read.orc("/some/path")
newDF.printSchema
The output of df.printSchema is
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
The output of newDF.printSchema is
|-- label: double (nullable = true)
|-- features: struct (nullable = true)
| |-- type: byte (nullable = true)
| |-- size: integer (nullable = true)
| |-- indices: array (nullable = true)
| | |-- element: integer (containsNull = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = true)
What is the issue here? I am using Spark 2.2.0 with Scala 2.11.8

Related

Add a field already exists in df pyspark in struct field

I have the follow df:
sku category price state infos_gerais
33344 mmmma 3.00 SP [{5, 5656655, 5845454}]
33344 mmmma 3.00 MG [{5, 6565767, 5854545}]
33344 mmmma 3.00 RS [{5, 8788787, 4564646}]
The schema of df follow:
|-- sku: string (nullable = true)
|-- category: string (nullable = true)
|-- price: double (nullable = true)
|-- state: string (nullable = true)
|-- infos_gerais: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- service_type_id: integer (nullable = true)
| | |-- cep_ini: integer (nullable = true)
| | |-- cep_fim: integer (nullable = true)
See that in df the field that don't repeat is 'state', so I need insert this field in struct 'infos_gerais' and apply a groupBy, so I try this below code, but return a error. Anyone can help me?
df_end = df_end.withColumn(
"infos_gerais",
sf.collect_list(
sf.struct(
sf.col("infos_gerais.*"),
sf.col('infos_gerais.state').alias('state'))
)
)
I need the follow df output:
sku category price infos_gerais
33344 mmmma 3.00 [{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG},{5, 8788787, 4564646, RS}]
given you have an array of structs, you can use transform to process the elements of the array and withField on the structs to add/replace a struct field.
here's a simple example
data_sdf. \
withColumn('infos_gerais',
func.transform('infos_gerais', lambda x: x.withField('state', func.col('state')))
). \
groupBy('sku', 'category', 'price'). \
agg(func.flatten(func.collect_list('infos_gerais')).alias('infos_gerais')). \
show(truncate=False)
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |sku |category|price|infos_gerais |
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |33344|mmmma |3.0 |[{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG}, {5, 8788787, 4564646, RS}]|
# +-----+--------+-----+---------------------------------------------------------------------------------+
# root
# |-- sku: string (nullable = true)
# |-- category: string (nullable = true)
# |-- price: double (nullable = true)
# |-- infos_gerais: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- service_type_id: integer (nullable = true)
# | | |-- cep_ini: integer (nullable = true)
# | | |-- cep_fim: integer (nullable = true)
# | | |-- state: string (nullable = true)

Spark Bitwise XOR between two DataFrame Columns

I have a Spark DataFrame testDF with the following structure
scala> testDF.printSchema()
-------------------------------------------------
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
Both array1 and array2 are guaranteed to be of same length.
I want to perform bitwise xor between each element from array1 and array2.
Something like:
res = array1[0] ^ array2[0] + array1[1] ^ array2[1] + ...
I know this can be done using udf but wondering if there is a native spark way to do it.
Desired DataFrame Structure would look something like
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- result: long (nullable = true)

How to update column value in case of array of struct in spark scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- Animal: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Elephant: string (nullable = false)
| | |-- Lion: string (nullable = true)
| | |-- Zebra: string (nullable = true)
| | |-- Dog: string (nullable = true)
I just want to is this posible to update the array of struct to some value if I Have a list of column of which I dont want to update.
For eg
If I have a list List[String] = List(Zebra,Dog)
Is this possible to set all other array of column to 0 like Elephant and Lion will be 0
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 1, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 1, 1, 0]]|
+---+----+-----+------+-------+--------------------+
After operations It will be
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------------+
I was going by iteration by row
Like I made a function
def changeValue(row :Row) = {
//some code
}
But not able to do so
Check below code.
scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+
scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)
Constructing Expression
val expr = flatten(
array(
ddf
.select(explode($"webhooks").as("webhooks"))
.select("webhooks.*")
.columns
.map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
)
)
expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))
Applying Expression
scala> ddf.withColumn("webhooks",struct(expr)).show(false)
+---+----+-----+------+-------+--------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------+
Final Schema
scala> ddf.withColumn("webhooks",allwebhookColumns).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- elephant: integer (nullable = false)
| | |-- lion: integer (nullable = false)
| | |-- zebra: integer (nullable = false)
| | |-- dog: integer (nullable = false)

Convert array of vectors to DenseVector

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.
This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)

How to join two dataframes where key to be used for joining has different datatype in both dataframes

I have two dataframes df1, df2 whose schema is as follows:
DF1 is of the form:
slotSize2: struct (nullable = true)
| |-- 120x600: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 160x600: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
| |-- 250x250: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 300x250: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
Dataframe df2 has schema :
root
|-- bidId: array (nullable = true)
| |-- element: string (containsNull = true)
|-- slotSize1: array (nullable = true)
| |-- element: string (containsNull = true)
In dataframe df2 we have slotSize (named as slotSize1) in form of string and in dataframe df1 we have nested form of slotsize i.e. for every slotsize we have corresponding map.
I want to join two dataframes df1, df2 to form a new dataframe df3 which has schema (bidId, slotSize , viewMap) where bidId is present in df1 , slotSIze is of form 120x600 and is present in both schemas, and viewMap corresponds to the nested map corresponding to every slotSize in df1.