Extract Array of Struct from Parquet into multi-value csv Spark Scala - scala

Using Spark Scala I am trying to extract an array of Struct from parquet. The input is a parquet file. The output is a csv file. The field of the csv can have "multi-values" delimited by "#;". The csv is delimited by ",". What is the best way to accomplish this?
Schema
root
|-- llamaEvent: struct (nullable = true)
| |-- activity: struct (nullable = true)
| | |-- Animal: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- time: string (nullable = true)
| | | | |-- status: string (nullable = true)
| | | | |-- llamaType: string (nullable = true)
Example Input as json (the input will be parquet)
{
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-1-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-2-2020",
"status":"Sitting",
"llamaType":"blue llama"
}
]
}
}
}
Desired CSV Output
time,status,llamaType
5-1-2020#;6-2-2020,running#;sitting,red llama#;blue llama
Update:
Based on some trial and error, I believe a solution like this maybe appropriate depending on use case. This does a "short cut" by grabbing the array item, cast it to string, then parse out extraneous characters, which is good for some use cases.
df.select(col("llamaEvent.activity").getItem("Animal").getItem("time")).cast("String"))
Then you can perform whatever parsing you want after such as regexp_replace
df.withColumn("time", regexp_replace(col("time"),",",";#"))
Several appropriate solutions were also proposed using groupby, explode, aggregate as well.

One approach would be to flatten the array of animal attribute structs using SQL function inline and aggregate the attributes via collect_list, followed by concatenating with the specific delimiter.
Given a DataFrame df similar to your provided schema, the following transformations will generate the wanted dataset, dfResult:
val attribCSVs = List("time", "status", "llamaType").map(
c => concat_ws("#;", collect_list(c)).as(c)
)
val dfResult = df.
select($"eventId", expr("inline(llamaEvent.activity.Animal)")).
groupBy("eventId").agg(attribCSVs.head, attribCSVs.tail: _*)
Note that an event identifying column eventId is added to the sample json data for the necessary groupBy aggregation.
Let's assemble some sample data:
val jsons = Seq(
"""{
"eventId": 1,
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-1-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-2-2020",
"status":"Sitting",
"llamaType":"blue llama"
}
]
}
}
}""",
"""{
"eventId": 2,
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-2-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-3-2020",
"status":"Standing",
"llamaType":"blue llama"
}
]
}
}
}"""
)
val df = spark.read.option("multiLine", true).json(jsons.toDS)
df.show(false)
+-------+----------------------------------------------------------------------+
|eventId|llamaEvent |
+-------+----------------------------------------------------------------------+
|1 |{{[{red llama, Running, 5-1-2020}, {blue llama, Sitting, 6-2-2020}]}} |
|2 |{{[{red llama, Running, 5-2-2020}, {blue llama, Standing, 6-3-2020}]}}|
+-------+----------------------------------------------------------------------+
Applying the above transformations, dfResult should look like below:
dfResult.show(false)
+-------+------------------+-----------------+---------------------+
|eventId|time |status |llamaType |
+-------+------------------+-----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting |red llama#;blue llama|
|2 |5-2-2020#;6-3-2020|Running#;Standing|red llama#;blue llama|
+-------+------------------+-----------------+---------------------+
Writing dfResult to a CSV file:
dfResult.write.option("header", true).csv("/path/to/csv")
/*
eventId,time,status,llamaType
1,5-1-2020#;6-2-2020,Running#;Sitting,red llama#;blue llama
2,5-2-2020#;6-3-2020,Running#;Standing,red llama#;blue llama
*/

This will be a working solution for you
df = spark.createDataFrame([(str([a_json]))],T.StringType())
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("llamaEvent", df.col.getItem("llamaEvent"))
df = df.withColumn("llamaEvent", F.from_json("llamaEvent", T.MapType(T.StringType(), T.StringType())))
df = df.select("*", F.explode("llamaEvent").alias("x","y"))
df = df.withColumn("Activity", F.from_json("y", T.MapType(T.StringType(), T.StringType())))
df = df.select("*", F.explode("Activity").alias("x","yy"))
df = df.withColumn("final_col", F.from_json("yy", T.ArrayType(T.StringType())))
df = df.withColumn("final_col", F.explode("final_col"))
df = df.withColumn("final_col", F.from_json("final_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("time", df.final_col.getItem("time")).withColumn("status", df.final_col.getItem("status")).withColumn("llamaType", df.final_col.getItem("llamaType")).withColumn("agg_col", F.lit("1"))
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
display(df)
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
| value| col| llamaEvent| x| y| Activity| x| yy| final_col| time| status| llamaType|agg_col|
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
|[{'llamaEvent': {...|[llamaEvent -> {"...|[activity -> {"An...|activity|{"Animal":[{"time...|[Animal -> [{"tim...|Animal|[{"time":"5-1-202...|[time -> 5-1-2020...|5-1-2020|Running| red llama| 1|
|[{'llamaEvent': {...|[llamaEvent -> {"...|[activity -> {"An...|activity|{"Animal":[{"time...|[Animal -> [{"tim...|Animal|[{"time":"5-1-202...|[time -> 6-2-2020...|6-2-2020|Sitting|blue llama| 1|
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
df_grp.show(truncate=False)
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

Related

Pyspark RDD column value selection

I have a rdd like this:
|item_id| recommendations|
+-------+------------------+
| 1|[{810, 5.2324243},{134, 4.58323},{810, 4.89248}]
| 23|[[{1643, 5.1180077}, {1463, 4.8429747}, {1368, 4.4758873}]
if I want to only extract the first value in each {} from col "recommendations".
Expected result looks like this:
|item_id| recommendations|
+-------+------------------+
| 1|[{810, 134, 810}]
| 23|[{1643, 1463, 1368}]
What should I do? Thanks!
Not sure if your data is an rdd or a dataframe, so I provide both here. Overall, from your sample data, I assume your recommendations is an array of struct type. You will know the exact columns by running df.printSchema() (if it was a dataframe) or rdd.first() (if it was an rdd). I created a dummy schema with two columns a and b.
This is my "dummy" class
class X():
def __init__(self, a, b):
self.a = a
self.b = b
This is my "dummy" data
schema = T.StructType([
T.StructField('id', T.IntegerType()),
T.StructField('rec', T.ArrayType(T.StructType([
T.StructField('a', T.IntegerType()),
T.StructField('b', T.FloatType()),
])))
])
df = spark.createDataFrame([
(1, [X(810, 5.2324243), X(134, 4.58323), X(810, 4.89248)]),
(23, [X(1643, 5.1180077), X(1463, 4.8429747), X(1368, 4.4758873)])
], schema)
If your data is a dataframe
df.show(10, False)
df.printSchema()
+---+---------------------------------------------------------+
|id |rec |
+---+---------------------------------------------------------+
|1 |[{810, 5.2324243}, {134, 4.58323}, {810, 4.89248}] |
|23 |[{1643, 5.1180077}, {1463, 4.8429747}, {1368, 4.4758873}]|
+---+---------------------------------------------------------+
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
(df
.select('id', F.explode('rec').alias('rec'))
.groupBy('id')
.agg(F.collect_list('rec.a').alias('rec'))
.show()
)
+---+------------------+
| id| rec|
+---+------------------+
| 1| [810, 134, 810]|
| 23|[1643, 1463, 1368]|
+---+------------------+
If your data is an rdd
dfrdd = df.rdd
dfrdd.first()
# Row(id=1, rec=[Row(a=810, b=5.232424259185791), Row(a=134, b=4.583230018615723), Row(a=810, b=4.89247989654541)])
(dfrdd
.map(lambda x: (x.id, [r.a for r in x.rec]))
.toDF()
.show()
)
+---+------------------+
| _1| _2|
+---+------------------+
| 1| [810, 134, 810]|
| 23|[1643, 1463, 1368]|
+---+------------------+

Read csv into Dataframe with nested column

I have a csv file like this:
weight,animal_type,animal_interpretation
20,dog,"{is_large_animal=true, is_mammal=true}"
3.5,cat,"{is_large_animal=false, is_mammal=true}"
6.00E-04,ant,"{is_large_animal=false, is_mammal=false}"
And I created case class schema with the following:
package types
case class AnimalsType (
weight: Option[Double],
animal_type: Option[String],
animal_interpretation: Option[AnimalInterpretation]
)
case class AnimalInterpretation (
is_large_animal: Option[Boolean],
is_mammal: Option[Boolean]
)
I tried to load the csv into a dataframe with:
var df = spark.read.format("csv").option("header", "true").load("src/main/resources/animals.csv").as[AnimalsType]
But got the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from animal_interpretation#12: need struct type but got string;
Am I doing something wrong? What would be the proper way of doing this?
You can not assigned schema to csv json directly. You need to do transform csv String column (animal_interpretation) into Json format, As I have done in below code using UDF. if you can get input data in format like df1 then there is no need of below UDF you can continue from df1 and get final dataframe df2.
There is no need of any case class since your data header contain column and for json data you need to declare schema AnimalInterpretationSch as below
scala> import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
//Input CSV DataFrame
scala> df.show(false)
+--------+-----------+---------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+---------------------------------------+
|20 |dog |{is_large_animal=true, is_mammal=true} |
|3.5 |cat |{is_large_animal=false, is_mammal=true}|
|6.00E-04|ant |{is_large_animal=false,is_mammal=false}|
+--------+-----------+---------------------------------------+
//UDF to convert "animal_interpretation" column to Json Format
scala> def StringToJson:UserDefinedFunction = udf((data:String,JsonColumn:String) => {
| var out = data
| val JsonColList = JsonColumn.trim.split(",").toList
| JsonColList.foreach{ rr =>
| out = out.replaceAll(rr, "'"+rr+"'")
| }
| out = out.replaceAll("=", ":")
| out
| })
//All column from Json
scala> val JsonCol = "is_large_animal,is_mammal"
//New dataframe with Json format
scala> val df1 = df.withColumn("animal_interpretation", StringToJson(col("animal_interpretation"), lit(JsonCol)))
scala> df1.show(false)
+--------+-----------+-------------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+-------------------------------------------+
|20 |dog |{'is_large_animal':true, 'is_mammal':true} |
|3.5 |cat |{'is_large_animal':false, 'is_mammal':true}|
|6.00E-04|ant |{'is_large_animal':false,'is_mammal':false}|
+--------+-----------+-------------------------------------------+
//Schema declarion of Json format
scala> val AnimalInterpretationSch = new StructType().add("is_large_animal", BooleanType).add("is_mammal", BooleanType)
//Accessing Json columns
scala> val df2 = df1.select(col("weight"), col("animal_type"),from_json(col("animal_interpretation"), AnimalInterpretationSch).as("jsondata")).select("weight", "animal_type", "jsondata.*")
scala> df2.printSchema
root
|-- weight: string (nullable = true)
|-- animal_type: string (nullable = true)
|-- is_large_animal: boolean (nullable = true)
|-- is_mammal: boolean (nullable = true)
scala> df2.show()
+--------+-----------+---------------+---------+
| weight|animal_type|is_large_animal|is_mammal|
+--------+-----------+---------------+---------+
| 20| dog| true| true|
| 3.5| cat| false| true|
|6.00E-04| ant| false| false|
+--------+-----------+---------------+---------+

How to parse JSON records in Structured Streaming?

I'm working on a spark structured streaming app and I'm trying to parse JSON given in below format.
{"name":"xyz","age":29,"details":["city":"mumbai","country":"India"]}
{"name":"abc","age":25,"details":["city":"mumbai","country":"India"]}
Below is my Spark code to parse the JSON:
import org.apache.spark.sql.types._
import spark.implicits._
val schema= new StructType()
.add("name",DataTypes.StringType )
.add("age", DataTypes.IntegerType)
.add("details",
new StructType()
.add("city", DataTypes.StringType)
.add("country", DataTypes.StringType)
)
val dfLogLines = dfRawData.selectExpr("CAST(value AS STRING)") //Converting binary to text
val personNestedDf = dfLogLines.select(from_json($"value", schema).as("person"))
val personFlattenedDf = personNestedDf.selectExpr("person.name", "person.age")
personFlattenedDf.printSchema()
personFlattenedDf.writeStream.format("console").option("checkpointLocation",checkpoint_loc3).start().awaitTermination()
Output:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+----+----+
|name| age|
+----+----+
|null|null|
|null|null|
+----+----+
The code does not throw any error but it returns null values in output. What am I doing wrong here?
Thanks in advance.
tl;dr The JSON looks not well-formed in the details field.
From the documentation of from_json standard function:
Returns null, in the case of an unparseable string.
The issue is with the details field.
{"details":["city":"mumbai","country":"India"]}
It looks like an array or a map, but none matches.
scala> Seq(Array("one", "two")).toDF("value").toJSON.show(truncate = false)
+-----------------------+
|value |
+-----------------------+
|{"value":["one","two"]}|
+-----------------------+
scala> Seq(Map("one" -> "two")).toDF("value").toJSON.show(truncate = false)
+-----------------------+
|value |
+-----------------------+
|{"value":{"one":"two"}}|
+-----------------------+
scala> Seq(("mumbai", "India")).toDF("city", "country").select(struct("city", "country") as "details").toJSON.show(truncate = false)
+-----------------------------------------------+
|value |
+-----------------------------------------------+
|{"details":{"city":"mumbai","country":"India"}}|
+-----------------------------------------------+
My recommendation would be to do the JSON parsing yourself using a user-defined function (UDF).

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

Sort by date an Array of a Spark DataFrame Column

I have a DataFrame formated as below:
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[3, 19/06/2012-02.42.01], [4, 17/06/2012-18.22.21]] |
|A |[[1, 15/06/2012-18.22.16], [2, 15/06/2012-09.22.35]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
I would like to sort each element of DateInfos column by date with the timestamp in the second element of my Array
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[4, 17/06/2012-18.22.21], [3, 19/06/2012-02.42.01]] |
|A |[[2, 15/06/2012-09.22.35], [1, 15/06/2012-18.22.16]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
the schema of my DataFrame is printed as below:
root
|-- C1: string (nullable = true)
|-- C2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = false)
I assume I have to create an udf which use a function with the following signature:
def sort_by_date(mouvements : Array[Any]) : Array[Any]
Do you have any idea?
That's indeed a bit tricky - because although the UDF's input and output types seem identical, we can't really define it that way - because the input is actually a mutable.WrappedArray[Row] and the output can't use Row or else Spark will fail to decode it into a Row...
So we define a UDF that takes a mutable.WrappedArray[Row] and returns an Array[(Int, String)]:
val sortDates = udf { arr: mutable.WrappedArray[Row] =>
arr.map { case Row(i: Int, s: String) => (i, s) }.sortBy(_._2)
}
val result = input.select($"Id", sortDates($"DateInfos") as "DateInfos")
result.show(truncate = false)
// +---+--------------------------------------------------+
// |Id |DateInfos |
// +---+--------------------------------------------------+
// |B |[[4,17/06/2012-18.22.21], [3,19/06/2012-02.42.01]]|
// |A |[[2,15/06/2012-09.22.35], [1,15/06/2012-18.22.16]]|
// |C |[[5,14/06/2012-05.20.01]] |
// +---+--------------------------------------------------+