Schema changes when writing Dataframe containing Vector

Schema changes when writing Dataframe containing Vector - scala

I am writing a Spark dataframe where one of the column is of Vector datatype as ORC. When I load back the dataframe the schema changes.
var df : DataFrame = spark.createDataFrame(Seq(
(1.0, Vectors.dense(0.0, 1.1, 0.1)),
(0.0, Vectors.dense(2.0, 1.0, -1.0)),
(0.0, Vectors.dense(2.0, 1.3, 1.0)),
(1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")
df.printSchema
df.write.mode(SaveMode.Overwrite).orc("/some/path")
val newDF = spark.read.orc("/some/path")
newDF.printSchema
The output of df.printSchema is
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
The output of newDF.printSchema is
|-- label: double (nullable = true)
|-- features: struct (nullable = true)
| |-- type: byte (nullable = true)
| |-- size: integer (nullable = true)
| |-- indices: array (nullable = true)
| | |-- element: integer (containsNull = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = true)
What is the issue here? I am using Spark 2.2.0 with Scala 2.11.8

Related

Add a field already exists in df pyspark in struct field

I have the follow df:
sku category price state infos_gerais
33344 mmmma 3.00 SP [{5, 5656655, 5845454}]
33344 mmmma 3.00 MG [{5, 6565767, 5854545}]
33344 mmmma 3.00 RS [{5, 8788787, 4564646}]
The schema of df follow:
|-- sku: string (nullable = true)
|-- category: string (nullable = true)
|-- price: double (nullable = true)
|-- state: string (nullable = true)
|-- infos_gerais: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- service_type_id: integer (nullable = true)
| | |-- cep_ini: integer (nullable = true)
| | |-- cep_fim: integer (nullable = true)
See that in df the field that don't repeat is 'state', so I need insert this field in struct 'infos_gerais' and apply a groupBy, so I try this below code, but return a error. Anyone can help me?
df_end = df_end.withColumn(
"infos_gerais",
sf.collect_list(
sf.struct(
sf.col("infos_gerais.*"),
sf.col('infos_gerais.state').alias('state'))
)
)
I need the follow df output:
sku category price infos_gerais
33344 mmmma 3.00 [{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG},{5, 8788787, 4564646, RS}]

given you have an array of structs, you can use transform to process the elements of the array and withField on the structs to add/replace a struct field.
here's a simple example
data_sdf. \
withColumn('infos_gerais',
func.transform('infos_gerais', lambda x: x.withField('state', func.col('state')))
). \
groupBy('sku', 'category', 'price'). \
agg(func.flatten(func.collect_list('infos_gerais')).alias('infos_gerais')). \
show(truncate=False)
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |sku |category|price|infos_gerais |
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |33344|mmmma |3.0 |[{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG}, {5, 8788787, 4564646, RS}]|
# +-----+--------+-----+---------------------------------------------------------------------------------+
# root
# |-- sku: string (nullable = true)
# |-- category: string (nullable = true)
# |-- price: double (nullable = true)
# |-- infos_gerais: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- service_type_id: integer (nullable = true)
# | | |-- cep_ini: integer (nullable = true)
# | | |-- cep_fim: integer (nullable = true)
# | | |-- state: string (nullable = true)

Spark Bitwise XOR between two DataFrame Columns

I have a Spark DataFrame testDF with the following structure
scala> testDF.printSchema()
-------------------------------------------------
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
Both array1 and array2 are guaranteed to be of same length.
I want to perform bitwise xor between each element from array1 and array2.
Something like:
res = array1[0] ^ array2[0] + array1[1] ^ array2[1] + ...
I know this can be done using udf but wondering if there is a native spark way to do it.
Desired DataFrame Structure would look something like
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- result: long (nullable = true)

How to update column value in case of array of struct in spark scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- Animal: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Elephant: string (nullable = false)
| | |-- Lion: string (nullable = true)
| | |-- Zebra: string (nullable = true)
| | |-- Dog: string (nullable = true)
I just want to is this posible to update the array of struct to some value if I Have a list of column of which I dont want to update.
For eg
If I have a list List[String] = List(Zebra,Dog)
Is this possible to set all other array of column to 0 like Elephant and Lion will be 0
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 1, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 1, 1, 0]]|
+---+----+-----+------+-------+--------------------+
After operations It will be
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------------+
I was going by iteration by row
Like I made a function
def changeValue(row :Row) = {
//some code
}
But not able to do so

Check below code.
scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+
scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)
Constructing Expression
val expr = flatten(
array(
ddf
.select(explode($"webhooks").as("webhooks"))
.select("webhooks.*")
.columns
.map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
)
)
expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))
Applying Expression
scala> ddf.withColumn("webhooks",struct(expr)).show(false)
+---+----+-----+------+-------+--------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------+
Final Schema
scala> ddf.withColumn("webhooks",allwebhookColumns).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- elephant: integer (nullable = false)
| | |-- lion: integer (nullable = false)
| | |-- zebra: integer (nullable = false)
| | |-- dog: integer (nullable = false)

Convert array of vectors to DenseVector

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.

This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)

How to join two dataframes where key to be used for joining has different datatype in both dataframes

I have two dataframes df1, df2 whose schema is as follows:
DF1 is of the form:
slotSize2: struct (nullable = true)
| |-- 120x600: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 160x600: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
| |-- 250x250: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 300x250: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
Dataframe df2 has schema :
root
|-- bidId: array (nullable = true)
| |-- element: string (containsNull = true)
|-- slotSize1: array (nullable = true)
| |-- element: string (containsNull = true)
In dataframe df2 we have slotSize (named as slotSize1) in form of string and in dataframe df1 we have nested form of slotsize i.e. for every slotsize we have corresponding map.
I want to join two dataframes df1, df2 to form a new dataframe df3 which has schema (bidId, slotSize , viewMap) where bidId is present in df1 , slotSIze is of form 120x600 and is present in both schemas, and viewMap corresponds to the nested map corresponding to every slotSize in df1.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Schema changes when writing Dataframe containing Vector - scala

Related

Add a field already exists in df pyspark in struct field

Spark Bitwise XOR between two DataFrame Columns

How to update column value in case of array of struct in spark scala

Convert array of vectors to DenseVector

How to join two dataframes where key to be used for joining has different datatype in both dataframes

Categories

Resources