Update Schema for DataFrame in Apache Spark - scala

I have a DataFrame with the following schema
root
|-- col_a: string (nullable = false)
|-- col_b: string (nullable = false)
|-- col_c_a: string (nullable = false)
|-- col_c_b: string (nullable = false)
|-- col_d: string (nullable = false)
|-- col_e: string (nullable = false)
|-- col_f: string (nullable = false)
now I want to convert the Schema for this data frame to something like this.
root
|-- col_a: string (nullable = false)
|-- col_b: string (nullable = false)
|-- col_c: struct (nullable = false)
|-- col_c_a: string (nullable = false)
|-- col_c_b: string (nullable = false)
|-- col_d: string (nullable = false)
|-- col_e: string (nullable = false)
|-- col_f: string (nullable = false)
I can able to do this with the help of map transformation by explicitly fetching the value of each column from row type but this is very complex process and does not look good So,
is there any way I can achieve this?
Thanks

There is an in-built struct function with the definition :
def struct(cols: Column*): Column
You can use it like :
df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
+---+---+
df.withColumn("struct_col", struct($"a", $"b")).show
+---+---+----------+
| a| b|struct_col|
+---+---+----------+
| 1| 2| [1,2]|
| 2| 3| [2,3]|
+---+---+----------+
The schema of the new dataframe being :
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- struct_col: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
In you case, you can do something like :
df.withColumn("col_c" , struct($"col_c_a", $"col_c_b") ).drop($"col_c_a").drop($"col_c_b")

Related

Selected Values of a JSON key Fetch to DataFrame in Spark scala

Structure of JSON looks like below.
|-- destination: struct (nullable = true)
| |-- activity: string (nullable = true)
| |-- id: string (nullable = true)
| |-- destination_class: array (nullable = true)
|-- Health: struct (nullable = true)
| |-- sample: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
Marks: struct (nullable = true)
| |-- exam_score: double (nullable = true)
|-- sourceID: string (nullable = true)
unique_exam_fields: struct (nullable = true)
| |-- indOrigin: string (nullable = true)
| |-- compo: string (nullable = true)
how come i select only few feilds from each object.
i am trying to bring below feilds to Dataframe.
from destination-- id and activity
from Health-- id and name
from Marks -- exam_score
code:
Code i tried as
val DF = spark.read.json("D:/data.json"),
but the above code bring all feilds
output-- Dataframe looks like
destination_id|activity|Health_id|Name|Exam_score
Please help
You can use the dot notation to access the nested structures and then give the columns an alias:
df.select(col("destination.id").as("destination_id"),
col("destination.activity").as("activity"),
col("Health.sample.id").as("Health_id"),
col("Health.sample.name").as("Name"),
col("Marks.exam_score").as("Exam_score"))
.show()
prints
+--------------+--------+---------+----+----------+
|destination_id|activity|Health_id|Name|Exam_score|
+--------------+--------+---------+----+----------+
| b| a| c| d| e|
| b1| a1| c1| d1| e1|
+--------------+--------+---------+----+----------+
Option: 1 Load complete file & select required columns like below.
Add all required columns inside Seq & then use those columns inside selectExpr
val columns = Seq(
"destination.id as destination_id",
"destination.activity as activity",
"Health.sample.id as health_id",
"Health.sample.name as name",
"Marks.exam_score as exam_score"
)
df.selectExpr(columns:_*)
Option: 2 Create StructType with required columns & apply schema before load file data.
val schema = // Your required columns in schema
val DF = spark.read.schema(schema).json("D:/data.json")

convert pyspark dataframe value to customized schema

I am receiving streaming data from Kafka. By default, the dataframe.value is of "string" type. for example, dataframe.value is
1.0,2.0,4,'a'
1.1,2.1,3,'a1'
The schema of dataframe.value:
root
|-- value: string (nullable = true)
Now I want to define a schema on this data frame. The schema I want to get an output:
root
|-- c1: double (nullable = true)
|-- c2: double (nullable = true)
|-- c3: integer (nullable = true)
|-- c4: string (nullable = true)
I define the schema and then load the data from kafka but I get error "Kafka has already defined schema can not apply the customized one".
Any help on this issue will be highly appreciated.
You can define the schema when you convert to a data frame.
from pyspark.sql.types import StringType, IntegerType, DoubleType
kafkaRdd = sc.parallelize([(1.0,2.0,4,'a'), (1.1,2.1,3,'a1')])
col_types = [DoubleType(), DoubleType(), IntegerType(), StringType()]
col_names = ["c1", "c2", "c3", "c4"]
df = kafkaRdd.toDF(col_names, col_types)
df.show()
df.printSchema()
Here is the output:
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|1.0|2.0| 4| a|
|1.1|2.1| 3| a1|
+---+---+---+---+
root
|-- c1: double (nullable = true)
|-- c2: double (nullable = true)
|-- c3: long (nullable = true)
|-- c4: string (nullable = true)

Spark/Scala: join dataframes when id is nested in an array of structs

I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))

Pivot spark multilevel Dataset

I have the Dataset in Spark with these schemas:
root
|-- from: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v1: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v2: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v3: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- to: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
How to make the table(with only 3 columns id,name,tags) from this Dataset on Scala?
Just combine all the columns into an array, explode and select all nested fields:
import org.apache.spark.sql.functions.{array, col, explode}
case class Vertex(id: String, name: String, tags: String)
val df = Seq(((
Vertex("1", "from", "a"), Vertex("2", "V1", "b"), Vertex("3", "V2", "c"),
Vertex("4", "v3", "d"), Vertex("5", "to", "e")
)).toDF("from", "v1", "v2", "v3", "to")
df.select(explode(array(df.columns map col: _*)).alias("col")).select("col.*")
with the result as follows:
+---+----+----+
| id|name|tags|
+---+----+----+
| 1|from| a|
| 2| V1| b|
| 3| V2| c|
| 4| v3| d|
| 5| to| e|
+---+----+----+

How to extract all individual elements from a nested WrappedArray from a DataFrame in Spark

How can I get all individual elements from MEMEBERDETAIL?
scala> xmlDF.printSchema
root
|-- MEMBERDETAIL: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FILE_ID: double (nullable = true)
| | |-- INP_SOURCE_ID: long (nullable = true)
| | |-- NET_DB_CR_SW: string (nullable = true)
| | |-- NET_PYM_AMT: string (nullable = true)
| | |-- ORGNTD_DB_CR_SW: string (nullable = true)
| | |-- ORGNTD_PYM_AMT: double (nullable = true)
| | |-- RCVD_DB_CR_SW: string (nullable = true)
| | |-- RCVD_PYM_AMT: string (nullable = true)
| | |-- RECON_DATE: string (nullable = true)
| | |-- SLNO: long (nullable = true)
scala> xmlDF.head
res147: org.apache.spark.sql.Row = [WrappedArray([1.1610100000001425E22,1,D, 94,842.38,C,0.0,D, 94,842.38,2016-10-10,1], [1.1610100000001425E22,1,D, 33,169.84,C,0.0,D, 33,169.84,2016-10-10,2], [1.1610110000001425E22,1,D, 155,500.88,C,0.0,D, 155,500.88,2016-10-11,3], [1.1610110000001425E22,1,D, 164,952.29,C,0.0,D, 164,952.29,2016-10-11,4], [1.1610110000001425E22,1,D, 203,061.06,C,0.0,D, 203,061.06,2016-10-11,5], [1.1610110000001425E22,1,D, 104,040.01,C,0.0,D, 104,040.01,2016-10-11,6], [2.1610110000001427E22,1,C, 849.14,C,849.14,C, 0.00,2016-10-11,7], [1.1610100000001465E22,1,D, 3.78,C,0.0,D, 3.78,2016-10-10,1], [1.1610100000001465E22,1,D, 261.54,C,0.0,D, ...
After trying many ways, I am able to get just "Any" object like below but again not able to read all fields separately.
xmlDF.select($"MEMBERDETAIL".getItem(0)).head().get(0)
res56: Any = [1.1610100000001425E22,1,D,94,842.38,C,0.0,D,94,842.38,2016-10-10,1]
And StructType is like below -
res61: org.apache.spark.sql.DataFrame = [MEMBERDETAIL[0]: struct<FILE_ID:double,INP_SOURCE_ID:bigint,NET_DB_CR_SW:string,NET_PYM_AMT:string,ORGNTD_DB_CR_SW:string,ORGNTD_PYM_AMT:double,RCVD_DB_CR_SW:string,RCVD_PYM_AMT:string,RECON_DATE:string,SLNO:bigint>]
This actually helped me -
xmlDF.selectExpr("explode(MEMBERDETAIL) as e").select("e.FILE_ID", "e.INP_SOURCE_ID", "e.NET_DB_CR_SW", "e.NET_PYM_AMT", "e.ORGNTD_DB_CR_SW", "e.ORGNTD_PYM_AMT", "e.RCVD_DB_CR_SW", "e.RCVD_PYM_AMT", "e.RECON_DATE").show()