I have the Dataset in Spark with these schemas:
root
|-- from: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v1: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v2: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v3: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- to: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
How to make the table(with only 3 columns id,name,tags) from this Dataset on Scala?
Just combine all the columns into an array, explode and select all nested fields:
import org.apache.spark.sql.functions.{array, col, explode}
case class Vertex(id: String, name: String, tags: String)
val df = Seq(((
Vertex("1", "from", "a"), Vertex("2", "V1", "b"), Vertex("3", "V2", "c"),
Vertex("4", "v3", "d"), Vertex("5", "to", "e")
)).toDF("from", "v1", "v2", "v3", "to")
df.select(explode(array(df.columns map col: _*)).alias("col")).select("col.*")
with the result as follows:
+---+----+----+
| id|name|tags|
+---+----+----+
| 1|from| a|
| 2| V1| b|
| 3| V2| c|
| 4| v3| d|
| 5| to| e|
+---+----+----+
Related
We have a dataFrame that looks like:
root
|-- id: string (nullable = true)
|-- key1_suffix1: string (nullable = true)
|-- key2_suffix1: string (nullable = true)
|-- suffix1: string (nullable = true)
|-- key1_suffix2: string (nullable = true)
|-- key2_suffix2: string (nullable = true)
|-- suffix2: string (nullable = true)
How can we convert this into another dataframe like this:
root
|-- id: string (nullable = true)
|-- tags: struct (nullable = true)
| |-- suffix1: struct (nullable = true)
| | |-- key1_suffix1: string (nullable = true)
| | |-- key2_suffix1: string (nullable = true)
| | |-- suffix1: string (nullable = true)
| |-- suffix2: struct (nullable = true)
| | |-- key1_suffix2: string (nullable = true)
| | |-- key2_suffix2: string (nullable = true)
| | |-- suffix2: string (nullable = true)
Input array with suffixes will be already given.
example inputSuffix=["suffix1","suffix2"]
This is needed in spark scala code. Spark=3.1 and scala = 2.12
You can use struct() function to group columns into 1 nested columns:
// test data
import spark.implicits._
val df = Seq(
("1", "a", "b", "c", "d", "e", "f"),
("2", "aa", "bb", "cc", "dd", "ee", "ff")
).toDF("id", "key1_suffix1", "key2_suffix1", "suffix1", "key1_suffix2", "key2_suffix2", "suffix2")
// Processing
val res = df.withColumn("tags", struct(struct("key1_suffix1", "key2_suffix1", "suffix1").as("suffix1"),
struct("key1_suffix2", "key2_suffix2", "suffix2").as("suffix2")))
.drop("key1_suffix1", "key2_suffix1", "suffix1", "key1_suffix2", "key2_suffix2", "suffix2")
res.printSchema()
root
|-- id: string (nullable = true)
|-- tags: struct (nullable = false)
| |-- suffix1: struct (nullable = false)
| | |-- key1_suffix1: string (nullable = true)
| | |-- key2_suffix1: string (nullable = true)
| | |-- suffix1: string (nullable = true)
| |-- suffix2: struct (nullable = false)
| | |-- key1_suffix2: string (nullable = true)
| | |-- key2_suffix2: string (nullable = true)
| | |-- suffix2: string (nullable = true)
UPDATE
This can be done dynamically using a list of columns, if you value in the list that doesn't exist in the dataframe you can remove them to make sure you will not get some errors:
val inputSuffix = Array("suffix1", "suffix2", "suffix3")
val inputSuffixFiltred = inputSuffix.filter(c => df.columns.contains(s"key1_$c") && df.columns.contains(s"key2_$c") && df.columns.contains(c))
val tagsCol = inputSuffixFiltred.map(c => struct(s"key1_$c", s"key2_$c", c).as(c))
val colsToDelete = inputSuffixFiltred.flatMap(c => Seq(s"key1_$c", s"key2_$c", c))
val res = df.withColumn("tags", struct(tagsCol: _*)).drop(colsToDelete: _*)
res.printSchema()
I have a df_have with this schema:
root
|-- events: struct (nullable = true)
| |-- eventA: boolean (nullable = true)
| |-- eventB: boolean (nullable = true)
| |-- eventC: boolean (nullable = true)
| |-- eventD: boolean (nullable = true)
|--id: long
And would like to end up with df_want, where the array contains only events that are True:
root
|-- events_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- eventName: string (nullable = true)
| | |-- eventValue: integer (nullable = true)
|-- id: long
For example, if in df_have a row looks like:
+------------------------------------------------+--------+
|events |id |
+------------------------------------------------+--------+
|{false, true, false, true} |12345678|
+------------------------------------------------+--------+
I'd like to df_want to look like:
+------------------------------------------------+--------+
|events |id |
+------------------------------------------------+--------+
|[{eventB, 1}, {eventD,1}] |12345678|
+------------------------------------------------+--------+
I have dataframe like below
id contact_persons
-----------------------
1 [[abc, abc#xyz.com, 896676, manager],[pqr, pqr#xyz.com, 89809043, director],[stu, stu#xyz.com, 09909343, programmer]]
schema looks like this.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
i need to convert this dataframe like below schema.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- emails: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- phone: string (nullable = true)
| | |-- roles: string (nullable = true)
I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.
You can use TRANSFORM expression to cast it:
import pyspark.sql.functions as f
df = spark.createDataFrame([
[1, [['abc', 'abc#xyz.com', '896676', 'manager'],
['pqr', 'pqr#xyz.com', '89809043', 'director'],
['stu', 'stu#xyz.com', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')
expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))
# output_df.printSchema()
# root
# |-- id: string (nullable = true)
# |-- contact_persons: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- name: string (nullable = true)
# | | |-- emails: string (nullable = true)
# | | |-- phone: string (nullable = true)
# | | |-- roles: string (nullable = true)
output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1 |[{abc, abc#xyz.com, 896676, manager}, {pqr, pqr#xyz.com, 89809043, director}, {stu, stu#xyz.com, 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+
I've the below dataframe schema as df.currentSchema and need to obtain the expectedSchema as df.expectedSchema, is there a way i can achieve this in Spark 2.3
df.currentSchema:
|-- enqueuedTime: timestamp (nullable = true)
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
df.expectedSchema:
|-- enqueuedTime: timestamp (nullable = true)
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- SIGNAL: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- SN: string (nullable = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
| | |-- SN: string (nullable = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
| | |-- SN: string (nullable = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
Sample data:
+----------------+---+---------+-----------------------------------------------------+--------------------------+
|vin |tt |msg_type |ada |adw | |
+-----------------+---+---------+-----------------------------------------------------+--------------------------+
|FU7XXXXXXXXXXXXXX|0 |SIGNAL |[{"E":15XXXXXXXX,"V":2, {"E":15XXXXXXXX,"V":1}] |null |
|FU7XXXXXXXXXXXXXX|0 |SIGNAL |null |[{"E":15XXXXXXXX,"V":3}] |
|FU7XXXXXXXXXXXXXX|0 |SIGNAL |null |[{"E":15XXXXXXXX,"V":4.1}]|
+-----------------+---+---------+--------------------------+--------------------------+--------------------------+
Note: Two things need to be achieved here:
New field SN to be created for each E, V pair within the element and it's value should be a array name. ex: For the first array column(ADA), the value of SN = ADA.
Merge the arrays(ADA, ADW) into one single outer array(SIGNAL).
The schema you are looking is incorrect & might fail while you are writing the dataframe.
I tweaked the schema as below:
scala> newDF.printSchema
root
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: long (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- number: long (nullable = true)
|-- tt: long (nullable = true)
|-- vin: string (nullable = true)
|-- sig: struct (nullable = false)
| |-- SN: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- E: string (nullable = true)
| | | |-- V: long (nullable = true)
| |-- SN: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- E: string (nullable = true)
| | | |-- V: long (nullable = true)
If you are fine with this schema, please read further in achieving it.
Created dummy data to replicate your schema (this step can be ignored)
scala> val vas = """{"df":[ { "vin": "FU7XXXXXXXXXXXXXX", "tt": 0, "MSG_TYPE": "SIGNAL", "number": 123, "ADA": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}], "ADW": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}] }, { "vin": "FU7XXXXXXXXXXXXXX", "tt": 0, "MSG_TYPE": "SIGNAL", "number": 123, "ADA": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}], "ADW": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}] }, { "vin": "FU7XXXXXXXXXXXXXX", "tt": 0, "MSG_TYPE": "SIGNAL", "number": 123, "ADA": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}], "ADW":[{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}] }] }"""
vas: String = {"df":[ { "vin": "FU7XXXXXXXXXXXXXX", "tt": 0, "MSG_TYPE": "SIGNAL", "number": 123, "ADA": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}], "ADW": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}] }, { "vin": "FU7XXXXXXXXXXXXXX", "tt": 0, "MSG_TYPE": "SIGNAL", "number": 123, "ADA": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}], "ADW": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}] }, { "vin": "FU7XXXXXXXXXXXXXX", "tt": 0, "MSG_TYPE": "SIGNAL", "number": 123, "ADA": [{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}], "ADW":[{"E":"15XXXXXXXX","V":2}, {"E":"15XXXXXXXX","V":1}] }] }
scala> val df = spark.read.json(Seq(vas).toDS).toDF.withColumn("arr", explode($"df")).select("arr.*")
df: org.apache.spark.sql.DataFrame = [ADA: array<struct<E:string,V:bigint>>, ADW: array<struct<E:string,V:bigint>> ... 4 more fields]
I hope this is how your data looks like:
scala> df.show(false)
+--------------------------------+--------------------------------+--------+------+---+-----------------+
|ADA |ADW |MSG_TYPE|number|tt |vin |
+--------------------------------+--------------------------------+--------+------+---+-----------------+
|[[15XXXXXXXX,2], [15XXXXXXXX,1]]|[[15XXXXXXXX,2], [15XXXXXXXX,1]]|SIGNAL |123 |0 |FU7XXXXXXXXXXXXXX|
|[[15XXXXXXXX,2], [15XXXXXXXX,1]]|[[15XXXXXXXX,2], [15XXXXXXXX,1]]|SIGNAL |123 |0 |FU7XXXXXXXXXXXXXX|
|[[15XXXXXXXX,2], [15XXXXXXXX,1]]|[[15XXXXXXXX,2], [15XXXXXXXX,1]]|SIGNAL |123 |0 |FU7XXXXXXXXXXXXXX|
+--------------------------------+--------------------------------+--------+------+---+-----------------+
Step to achieve the required output
scala> val newDF = df.withColumn("sig", struct($"ADA".as("SN"), $"ADW".as("SN")))
newDF: org.apache.spark.sql.DataFrame = [ADA: array<struct<E:string,V:bigint>>, ADW: array<struct<E:string,V:bigint>> ... 5 more fields]
scala> newDF.printSchema
root
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: long (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- number: long (nullable = true)
|-- tt: long (nullable = true)
|-- vin: string (nullable = true)
|-- sig: struct (nullable = false)
| |-- SN: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- E: string (nullable = true)
| | | |-- V: long (nullable = true)
| |-- SN: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- E: string (nullable = true)
| | | |-- V: long (nullable = true)
I tried to write this dataframe & it is working fine
newDF.write.mode("overwrite").parquet(path + "newDF.parquet")
I have a DataFrame with the following schema
root
|-- col_a: string (nullable = false)
|-- col_b: string (nullable = false)
|-- col_c_a: string (nullable = false)
|-- col_c_b: string (nullable = false)
|-- col_d: string (nullable = false)
|-- col_e: string (nullable = false)
|-- col_f: string (nullable = false)
now I want to convert the Schema for this data frame to something like this.
root
|-- col_a: string (nullable = false)
|-- col_b: string (nullable = false)
|-- col_c: struct (nullable = false)
|-- col_c_a: string (nullable = false)
|-- col_c_b: string (nullable = false)
|-- col_d: string (nullable = false)
|-- col_e: string (nullable = false)
|-- col_f: string (nullable = false)
I can able to do this with the help of map transformation by explicitly fetching the value of each column from row type but this is very complex process and does not look good So,
is there any way I can achieve this?
Thanks
There is an in-built struct function with the definition :
def struct(cols: Column*): Column
You can use it like :
df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
+---+---+
df.withColumn("struct_col", struct($"a", $"b")).show
+---+---+----------+
| a| b|struct_col|
+---+---+----------+
| 1| 2| [1,2]|
| 2| 3| [2,3]|
+---+---+----------+
The schema of the new dataframe being :
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- struct_col: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
In you case, you can do something like :
df.withColumn("col_c" , struct($"col_c_a", $"col_c_b") ).drop($"col_c_a").drop($"col_c_b")