Spark dataframe with complex & nested data - scala

I have 3 dataframes currently
Call them dfA, dfB, and dfC
dfA has 3 cols
|Id | Name | Age |
dfB has say 5 cols. the 2nd col, is a FK reference back to dFA record.
|Id | AId | Street | City | Zip |
Similarily dfC has 3 cols, also with a reference back to dfA
|Id | AId | SomeField |
Using Spark SQL i can do an JOIN across the 3
%sql
SELECT * FROM dfA
INNER JOIN dfB ON dfA.Id = dfB.AId
INNER JOIN dfC ON dfA.Id = dfC.AId
And I'll get my resultset, but it's been "flattened" as SQL would do with tabular results like this.
I want to load it in to a complex schema like this
val destinationSchema = new StructType()
.add("id", IntegerType)
.add("name", StringType)
.add("age", StringType)
.add("b",
new StructType()
.add("street", DoubleType, true)
.add("city", StringType, true)
.add("zip", StringType, true)
)
.add("c",
new StructType()
.add("somefield", StringType, true)
)
Any ideas how to take the results of the SELECT and save to dataframe with specifying the schema?
I ultimately want to save the complex StructType, or JSON, and load this is to Mongo DB using the Mongo Spark Connector.
Or, is there a better way to accomplish this from the 3 seperate dataframes (which were originally 3 seperate CSV files that were read in)?

given three dataframes, loaded in from csv files, you can do this:
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.select($"id",$"name",$"age",struct($"street",$"city",$"zip") as "b",struct($"somefield") as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
which will output:
row
"{""id"":100,""name"":""John"",""age"":""43"",""b"":{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""},""c"":{""somefield"":""appples""}}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""},""c"":{""somefield"":""bananas""}}"
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""},""c"":{""somefield"":""pears""}}"

the previous one worked if all the records had a 1:1 relationship.
here is how you can achieve it for 1:M (hint: use collect_set to group rows)
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.groupBy($"id",$"name",$"age")
.agg(collect_set(struct($"street",$"city",$"zip")) as "b",collect_set(struct($"somefield")) as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
display(jsonDestDF)
which will give you the following output:
row
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":[{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""}],""c"":[{""somefield"":""pears""},{""somefield"":""pineapples""}]}"
"{""id"":100,""name"":""John"",""age"":""43"",""b"":[{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""}],""c"":[{""somefield"":""appples""}]}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":[{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""}],""c"":[{""somefield"":""grapes""},{""somefield"":""peaches""},{""somefield"":""bananas""}]}"
sample data I used just in case anyone wants to play:
atable.csv
100,"John",43
101,"Sally",34
102,"Damian",23
104,"Rita",14
105,"Mohit",23
btable.csv:
100,"Dark Road","Washington",98002
101,"Light Ave","Los Angeles",90210
102,"Short Street","New York",70701
104,"Long Drive","Buffalo",80345
105,"Circular Quay","Orlando",65403
ctable.csv:
100,"appples"
101,"bananas"
102,"pears"
101,"grapes"
102,"pineapples"
101,"peaches"

Related

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

I have dataframe with 3 columns
date
jsonString1
jsonString2
I want to expand attributes inside json into columns. so i did something like this.
val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))
val json1Table = json1.selectExpr("id", "status")
val json2Table = json2.selectExpr("name", "address")
now i want to put these table together. so i did the following
val json1TableWithIndex = addColumnIndex(json1Table)
val json2TableWithIndex = addColumnIndex(json2Table)
var finalResult = json1Table
.join(json2Table, Seq("columnindex"))
.drop("columnindex")
def addColumnIndex(df: DataFrame) = spark.createDataFrame(
df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
After sampling few rows I observe that rows match exactly as in the source dataframe
I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.
It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.
You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.
Here is an example fo the whole process.
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val df = (Seq(
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"),
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}")
).toDF("date","jsonString1","jsonString2"))
scala> df.show()
+----------+--------------------+--------------------+
| date| jsonString1| jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+
val schema1 = (StructType(Seq(
StructField("id", StringType, true),
StructField("status", StringType, true)
)))
val schema2 = (StructType(Seq(
StructField("name", StringType, true),
StructField("address", StringType, true)
)))
val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
.withColumn("jsonData2", from_json(col("jsonString2"), schema2))
.withColumn("id", col("jsonData1.id"))
.withColumn("status", col("jsonData1.status"))
.withColumn("name", col("jsonData2.name"))
.withColumn("address", col("jsonData2.address"))
.drop("jsonString1")
.drop("jsonString2")
.drop("jsonData1")
.drop("jsonData2"))
scala> dfFlattened.show()
+----------+---+-------+--------+---------+
| date| id| status| name| address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant| Ali|Jebel Ali|
+----------+---+-------+--------+---------+

Transform structured data to JSON format using Spark Scala

I've my "Structured data" as shown below, I need to transform it to the below shown "Expected results" type. My "Output schema" is shown as well. Appreciate if you can provide some help on how I can achieve this using Spark Scala code.
Note: Grouping on the structured data to be done the columns SN and VIN.
There should be one row for the same SN and VIN, if either SN or VIN changes, then data to be present in the next row.
Structured data:
+-----------------+-------------+--------------------+---+
|VIN |ST |SV |SN |
|FU74HZ501740XXXXX|1566799999225|44.0 |APP|
|FU74HZ501740XXXXX|1566800002758|61.0 |APP|
|FU74HZ501740XXXXX|1566800009446|23.39 |ASP|
Expected results:
Output schema:
val outputSchema = StructType(
List(
StructField("VIN", StringType, true),
StructField("EVENTS", ArrayType(
StructType(Array(
StructField("SN", StringType, true),
StructField("ST", IntegerType, true),
StructField("SV", DoubleType, true)
))))
)
)
From Spark 2.1 you can achieve this using struct and collect_list.
val df_2 = Seq(
("FU74HZ501740XXXX",1566799999225.0,44.0,"APP"),
("FU74HZ501740XXXX",1566800002758.0,61.0,"APP"),
("FU74HZ501740XXXX",1566800009446.0,23.39,"ASP")
).toDF("vin","st","sv","sn")
df_2.show(false)
+----------------+-----------------+-----+---+
|vin |st |sv |sn |
+----------------+-----------------+-----+---+
|FU74HZ501740XXXX|1.566799999225E12|44.0 |APP|
|FU74HZ501740XXXX|1.566800002758E12|61.0 |APP|
|FU74HZ501740XXXX|1.566800009446E12|23.39|ASP|
+----------------+-----------------+-----+---+
Use collect_list with struct:
df_2.groupBy("vin","sn")
.agg(collect_list(struct($"st", $"sv",$"sn")).as("events"))
.withColumn("events",to_json($"events"))
.drop(col("sn"))
This will give the wwanted output:
+----------------+---------------------------------------------------------------------------------------------+
|vin |events |
+----------------+---------------------------------------------------------------------------------------------+
|FU74HZ501740XXXX|[{"st":1.566800009446E12,"sv":23.39,"sn":"ASP"}] |
|FU74HZ501740XXXX|[{"st":1.566799999225E12,"sv":44.0,"sn":"APP"},{"st":1.566800002758E12,"sv":61.0,"sn":"APP"}]|
+----------------+---------------------------------------------------------------------------------------------+
You can get it via SparkSession.
val df = spark.read.json("/path/to/json/file/test.json")
here spark is the SparkSession object

creating dataframe by loading csv file using scala in spark

but csv file is added with extra double quotes which results all cloumns into single column
there are four columns,header and 2 rows
"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv")
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]
What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api
//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)
You should be getting
+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1 |Priya|78 |Phone |
|2 |Jhon |20 |mail |
+----+-----+---+-------+
I hope the answer is helpful

Join two text file with one column different in their schema in spark scala

I have two text files and I am creating data frame out of that. Both files have the same no of columns except one column.
When I crate schema and join both I get error like
java.lang.ArrayIndexOutOfBoundsException
Basically my schema has columns and my one of text file has only 5 columns.
No how to append some null value to already created schema and then do join?
Here is my code
val schema = StructType(Array(
StructField("TimeStamp", StringType),
StructField("Id", StringType),
StructField("Name", StringType),
StructField("Val", StringType),
StructField("Age", StringType),
StructField("Dept", StringType)))
val textRdd1 = sc.textFile("s3://test/Text1.txt")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split(",", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("s3://test/Text2.txt")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split(",", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
val df3 = df1.join(df2)
TimeStamp column is not present in the first text file ...
Why don't you just exclude TimeStamp field from schema for first DataFrame?
val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))
As mentioned in comments, you need not schemas to be similar. You also can specify you join condition and select columns to join on.
Add the Timestamp column to the 1st dataframe
import spark.sql.functions._
import org.apache.spark.sql.types.DataType
val df1Final = df1.withColumn("TimeStamp", lit(null).cast(Long))
Then proceed with the join
You can create a new schema without this field, and use this schema. What Dmitri was suggesting is to use the original schema and remove the field that you don’t need to save you writing a second schema definition.
Once you have the 2 files loaded in to a dataset you perform the JOIN base in the common fields and remove the duplicate columns, that I guess is what you want, doing this:
df3 = df1.join(df2, (df1("Id") === df2("Id")) && (df1("Name") === df2("Name")) && (df1("Val") === df2("Val")) && (df1("Age") === df2("Age")) && (df1("Dept") === df2("Dept")))
.drop(df2("Id"))
.drop(df2("Name"))
.drop(df2("Val"))
.drop(df2("Age"))
.drop(df2("Dept"))

sqlContext.createDataframe from Row with Schema. pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

After spending way to much time figuring out why I get the following error
pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
while trying to create a dataframe based on Rows and a Schema, I noticed the following:
With a Row inside my rdd called rrdRows looking as follows:
Row(a="1", b="2", c=3)
and my dfSchema defined as:
dfSchema = StructType([
StructField("c", IntegerType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True)
])
creating a dataframe as follows:
df = sqlContext.createDataFrame(rddRows, dfSchema)
brings the above mentioned Error, because Spark only considers the order of StructFields in the schema and does not match the name of the StructFields with the name of the Row fields.
In other words, in the above example, I noticed that spark tries to create a dataframe that would look as follow (if there would not be a typeError. e.x if everything would be of type String)
+---+---+---+
| c | b | a |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
is this really expected, or some sort of bug?
EDIT: the rddRows are create along those lines:
def createRows(dic):
res = Row(a=dic["a"],b=dic["b"],c=int(dic["c"])
return res
rddRows = rddDict.map(createRows)
where rddDict is a parsed JSON file.
The constructor of the Row sorts the keys if you provide keyword arguments. Take a look at the source code here. When I found out about that, I ended up sorting my schema accordingly before applying it to the dataframe:
sorted_fields = sorted(dfSchema.fields, key=lambda x: x.name)
sorted_schema = StructType(fields=sorted_fields)
df = sqlContext.createDataFrame(rddRows, sorted_schema)