transform dataset into case class via (wrapped) encoders - scala

I am new to Scala. Excuse my lack of knowledge.
This is my dataset:
val bfDS = sessions.select("bf")
sessions.select("bf").printSchema
|-- bf: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- s: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: string (nullable = true)
| | | |-- c: string (nullable = true)
| | |-- a: struct (nullable = true)
| | | |-- a: integer (nullable = true)
| | | |-- b: long (nullable = true)
| | | |-- c: integer (nullable = true)
| | | |-- d: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- a: string (nullable = true)
| | | | | |-- b: integer (nullable = true)
| | | | | |-- c: long (nullable = true)
| | |-- tr: struct (nullable = true)
| | | |-- a: integer (nullable = true)
| | | |-- b: long (nullable = true)
| | | |-- c: integer (nullable = true)
| | | |-- d: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- e: string (nullable = true)
| | | | | |-- f: integer (nullable = true)
| | | | | |-- g: long (nullable = true)
| | |-- cs: struct (nullable = true)
| | | |-- a: integer (nullable = true)
| | | |-- b: long (nullable = true)
| | | |-- c: integer (nullable = true)
| | | |-- d: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- e: string (nullable = true)
| | | | | |-- f: integer (nullable = true)
| | | | | |-- g: long (nullable = true)
1) I don't think I understand Scala datasets very well. A dataset is composed of rows, but when I print the schema, it shows an array. How does the dataset map to the array? Is each row is an element in the array?
2) I want to convert my dataset into a case class.
case class Features( s: Iterable[CustomType], a: Iterable[CustomType], tr: Iterable[CustomType], cs: Iterable[CustomType])
How do I convert my dataset and how do I use encoders?
Many thanks.

Welcome to StackOverflow. Sadly this question is too board for SO, take a look at "how to ask" to improve this and future questions.
However I will try to answer a few of your questions.
First, Spark Rows can encode a variety of values, including Arrays & Structures.
Second, your dataframe's rows are composed of only one column of type Array[...].
Third, if you want to create a Dataset from your df, your case class must match your schema, in such case it should be something like:
case class Features(array: Array[Elements])
case class Elements(s: CustomType, a: CustomType, tr: CustomType, cs: CustomType)
Finally, an Encoder is used to transform your case classes and their values to the Spark internal representation. You shouldn't bother too much about them yet - you just need to import spark.implicits._ and all the encoders you need will be there automatically.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val ds: Dataset[Features] = df.as[Features]
Also, you should take a look to this as a reference.

Related

Update a highly nested column from string to struct

|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: string (nullable = true)
I have the above nested schema where I want to change column z's log from string to struct.
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: struct (nullable = true)
| | | | | | |-- b: string (nullable = true)
| | | | | | |-- c: string (nullable = true)
I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.
Is there a way to do this with some udf or any other way?
I know it's easy to do this via from_json but the nested array of struct is causing issues.
I think it depends on the values in your log column. I mean, the way you want to split the string into 2 separate fields.
The following PySpark code will just "move" your log values to b and c fields.
# Example data:
schema = (
T.StructType([
T.StructField('x', T.ArrayType(T.StructType([
T.StructField('y', T.LongType()),
T.StructField('z', T.ArrayType(T.StructType([
T.StructField('log', T.StringType())
]))),
])))
])
)
df = spark.createDataFrame([
[
[[
9,
[[
'text'
]]
]]
]
], schema)
df.printSchema()
# root
# |-- x: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- y: long (nullable = true)
# | | |-- z: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- log: string (nullable = true)
df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(struct(e.z.log[0] as b, e.z.log[0] as c) as log)) as z))'))
df.printSchema()
# root
# |-- x: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- y: long (nullable = true)
# | | |-- z: array (nullable = false)
# | | | |-- element: struct (containsNull = false)
# | | | | |-- log: struct (nullable = false)
# | | | | | |-- b: string (nullable = true)
# | | | | | |-- c: string (nullable = true)
If string transformations are needed on log column, e.z.log[0] parts need to be changed to include string transformations.
Higher Order functions are your friend in this case. Coalesce basically. Code below
df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(coalesce(("1" as a,"2" as b)) as log))as z))')).printSchema()
root
|-- x: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- log: struct (nullable = false)
| | | | | |-- a: string (nullable = false)
| | | | | |-- b: string (nullable = false)

Exploding Nested Struct In Spark Dataframe having Different Schema

I have a json which has below schema:
|-- Pool: struct (nullable = true)
| |-- 1: struct (nullable = true)
| | |-- Client: struct (nullable = true)
| | | |-- 1: struct (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Alias: string (nullable = true)
| | | | |-- Chaddr: string (nullable = true)
| | | |-- 2: struct (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Alias: string (nullable = true)
| | | | |-- Chaddr: string (nullable = true)
| |-- 2: struct (nullable = true)
| | |-- Alias: string (nullable = true)
| | |-- Chaddr: string (nullable = true)
| | |-- ChaddrMask: string (nullable = true)
| | |-- Client: struct (nullable = true)
| | | |-- 1: struct (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Alias: string (nullable = true)
| | | | |-- Chaddr: string (nullable = true)
And the output that i am trying to achieve is:
PoolId ClientID Client_Active
1 1 true
1 2 false
2 1 true
This schema keeps on changing with json.Eg for now there are 2 Pool id, there may be another json which will have 5 Pool Id and same is with CLient Id.
The problem with is :
We cant use Explode on struct.
Pool cant be converted to Map as each time client has different client ID that leads to different schema for each row.
Any thought how to achieve this?
I have tried this link for converting to Struct to Map and then exploding but it doesn't work when there are different numbers of Client IDs in different Pool.
From my perspective you only need to define an UDF.
Here's an example :
Define a projection case class (what you want as a resulting structure)
case class Projection(PoolId: String, ClientID: String, Client_Active: Boolean)
Define an UDF like the one below, allowing you to work both with your structure (fields) and data:
val myUdf = udf{r: Row =>
r.schema.fields.flatMap{rf =>
val poolId = rf.name
val pool = r.getAs[Row](poolId)
val clientRow = pool.getAs[Row]("Client")
clientRow.schema.fields.map{cr =>
val clientId = cr.name
val isActive = clientRow.getAs[Row](clientId).getAs[Boolean]("Active")
Projection(poolId, clientId, isActive)
}
}
}
Use your UDF :
val newDF = df.select(explode(myUdf($"Pool")).as("projection"))
.select("projection.*")
.cache
newDF.show(false)
The output is the expected one :
+------+--------+-------------+
|PoolId|ClientID|Client_Active|
+------+--------+-------------+
|1 |1 |true |
|1 |2 |false |
|2 |1 |true |
+------+--------+-------------+

How to assign constant values to the nested objects in pyspark?

I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed.
This is the schema where I need some changes on the fields(answer_type,response0, response3):
| |-- choices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- choice_id: long (nullable = true)
| | | |-- created_time: long (nullable = true)
| | | |-- updated_time: long (nullable = true)
| | | |-- created_by: long (nullable = true)
| | | |-- updated_by: long (nullable = true)
| | | |-- answers: struct (nullable = true)
| | | | |-- answer_node_internal_id: long (nullable = true)
| | | | |-- label: string (nullable = true)
| | | | |-- text: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = true)
| | | | |-- data_tag: string (nullable = true)
| | | | |-- answer_type: string (nullable = true)
| | | |-- response: struct (nullable = true)
| | | | |-- response0: string (nullable = true)
| | | | |-- response1: long (nullable = true)
| | | | |-- response2: double (nullable = true)
| | | | |-- response3: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
Is there a way I could assign values to those fields without affecting the above structure in pyspark?
I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.
oh i got a similar problem days ago, i suggest to transform the structype to json
and then with a udf you can make the internal changes
and after you cant get the original struct again
you should see to_json and from_json from documentation.
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json

How can dataframe with list of lists can be explode each line as columns - pyspark

I have a data frame as below
--------------------+
| pas1|
+--------------------+
|[[[[H, 5, 16, 201...|
|[, 1956-09-22, AD...|
|[, 1961-03-19, AD...|
|[, 1962-02-09, AD...|
+--------------------+
want to extract few columns from each row from above 4 rows and create a dataframe like below . Column names should be from the schema not hard coded ones like column1 & column2.
---------|-----------+
| gender | givenName |
+--------|-----------+
| a | b |
| a | b |
| a | b |
| a | b |
+--------------------+
pas1 - schema
root
|-- pas1: struct (nullable = true)
| |-- contactList: struct (nullable = true)
| | |-- contact: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contactTypeCode: string (nullable = true)
| | | | |-- contactMediumTypeCode: string (nullable = true)
| | | | |-- contactTypeID: string (nullable = true)
| | | | |-- lastUpdateTimestamp: string (nullable = true)
| | | | |-- contactInformation: string (nullable = true)
| |-- dateOfBirth: string (nullable = true)
| |-- farePassengerTypeCode: string (nullable = true)
| |-- gender: string (nullable = true)
| |-- givenName: string (nullable = true)
| |-- groupDepositIndicator: string (nullable = true)
| |-- infantIndicator: string (nullable = true)
| |-- lastUpdateTimestamp: string (nullable = true)
| |-- passengerFOPList: struct (nullable = true)
| | |-- passengerFOP: struct (nullable = true)
| | | |-- fopID: string (nullable = true)
| | | |-- lastUpdateTimestamp: string (nullable = true)
| | | |-- fopFreeText: string (nullable = true)
| | | |-- fopSupplementaryInfoList: struct (nullable = true)
| | | | |-- fopSupplementaryInfo: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- type: string (nullable = true)
| | | | | | |-- value: string (nullable = true)
Thanks for the help
If you want to extract few columns from a dataframe containing structs, you can simply do something like this:
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.sparkContext.parallelize([Row(pas1=Row(gender='a', givenName='b'))]).toDF()
df.select('pas1.gender','pas1.givenName').show()
Instead, if you want to flatten your dataframe, this question should help you: How to unwrap nested Struct column into multiple columns?

How to unwind array in DataFrame (from JSON)?

Each record in an RDD contains a json. I'm using SQLContext to create a DataFrame from the Json like this:
val signalsJsonRdd = sqlContext.jsonRDD(signalsJson)
Below is the schema. datapayload is an array of items. I want to explode the array of items to get a dataframe where each row is an item from datapayload. I tried to do something based on this answer, but it seems that I would need to model the entire structure of the item in the case Row(arr: Array[...]) statement. I'm probably missing something.
val payloadDfs = signalsJsonRdd.explode($"data.datapayload"){
case org.apache.spark.sql.Row(arr: Array[String]) => arr.map(Tuple1(_))
}
The above code throws a scala.MatchError, because the type of the actual Row is very different from Row(arr: Array[String]). There is probably a simple way to do what I want, but I can't find it. Please help.
Schema give below
signalsJsonRdd.printSchema()
root
|-- _corrupt_record: string (nullable = true)
|-- data: struct (nullable = true)
| |-- dataid: string (nullable = true)
| |-- datapayload: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Reading: struct (nullable = true)
| | | | |-- A2DPActive: boolean (nullable = true)
| | | | |-- Accuracy: double (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Address: string (nullable = true)
| | | | |-- Charging: boolean (nullable = true)
| | | | |-- Connected: boolean (nullable = true)
| | | | |-- DeviceName: string (nullable = true)
| | | | |-- Guid: string (nullable = true)
| | | | |-- HandsFree: boolean (nullable = true)
| | | | |-- Header: double (nullable = true)
| | | | |-- Heading: double (nullable = true)
| | | | |-- Latitude: double (nullable = true)
| | | | |-- Longitude: double (nullable = true)
| | | | |-- PositionSource: long (nullable = true)
| | | | |-- Present: boolean (nullable = true)
| | | | |-- Radius: double (nullable = true)
| | | | |-- SSID: string (nullable = true)
| | | | |-- SSIDLength: long (nullable = true)
| | | | |-- SpeedInKmh: double (nullable = true)
| | | | |-- State: string (nullable = true)
| | | | |-- Time: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | | |-- Time: string (nullable = true)
| | | |-- Type: string (nullable = true)
tl;dr explode function is your friend (or my favorite flatMap).
explode function creates a new row for each element in the given array or map column.
Something like the following should work:
signalsJsonRdd.withColumn("element", explode($"data.datapayload"))
See functions object.