How to parse JSON records in Structured Streaming? - scala

I'm working on a spark structured streaming app and I'm trying to parse JSON given in below format.
{"name":"xyz","age":29,"details":["city":"mumbai","country":"India"]}
{"name":"abc","age":25,"details":["city":"mumbai","country":"India"]}
Below is my Spark code to parse the JSON:
import org.apache.spark.sql.types._
import spark.implicits._
val schema= new StructType()
.add("name",DataTypes.StringType )
.add("age", DataTypes.IntegerType)
.add("details",
new StructType()
.add("city", DataTypes.StringType)
.add("country", DataTypes.StringType)
)
val dfLogLines = dfRawData.selectExpr("CAST(value AS STRING)") //Converting binary to text
val personNestedDf = dfLogLines.select(from_json($"value", schema).as("person"))
val personFlattenedDf = personNestedDf.selectExpr("person.name", "person.age")
personFlattenedDf.printSchema()
personFlattenedDf.writeStream.format("console").option("checkpointLocation",checkpoint_loc3).start().awaitTermination()
Output:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+----+----+
|name| age|
+----+----+
|null|null|
|null|null|
+----+----+
The code does not throw any error but it returns null values in output. What am I doing wrong here?
Thanks in advance.

tl;dr The JSON looks not well-formed in the details field.
From the documentation of from_json standard function:
Returns null, in the case of an unparseable string.
The issue is with the details field.
{"details":["city":"mumbai","country":"India"]}
It looks like an array or a map, but none matches.
scala> Seq(Array("one", "two")).toDF("value").toJSON.show(truncate = false)
+-----------------------+
|value |
+-----------------------+
|{"value":["one","two"]}|
+-----------------------+
scala> Seq(Map("one" -> "two")).toDF("value").toJSON.show(truncate = false)
+-----------------------+
|value |
+-----------------------+
|{"value":{"one":"two"}}|
+-----------------------+
scala> Seq(("mumbai", "India")).toDF("city", "country").select(struct("city", "country") as "details").toJSON.show(truncate = false)
+-----------------------------------------------+
|value |
+-----------------------------------------------+
|{"details":{"city":"mumbai","country":"India"}}|
+-----------------------------------------------+
My recommendation would be to do the JSON parsing yourself using a user-defined function (UDF).

Related

Extract Array of Struct from Parquet into multi-value csv Spark Scala

Using Spark Scala I am trying to extract an array of Struct from parquet. The input is a parquet file. The output is a csv file. The field of the csv can have "multi-values" delimited by "#;". The csv is delimited by ",". What is the best way to accomplish this?
Schema
root
|-- llamaEvent: struct (nullable = true)
| |-- activity: struct (nullable = true)
| | |-- Animal: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- time: string (nullable = true)
| | | | |-- status: string (nullable = true)
| | | | |-- llamaType: string (nullable = true)
Example Input as json (the input will be parquet)
{
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-1-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-2-2020",
"status":"Sitting",
"llamaType":"blue llama"
}
]
}
}
}
Desired CSV Output
time,status,llamaType
5-1-2020#;6-2-2020,running#;sitting,red llama#;blue llama
Update:
Based on some trial and error, I believe a solution like this maybe appropriate depending on use case. This does a "short cut" by grabbing the array item, cast it to string, then parse out extraneous characters, which is good for some use cases.
df.select(col("llamaEvent.activity").getItem("Animal").getItem("time")).cast("String"))
Then you can perform whatever parsing you want after such as regexp_replace
df.withColumn("time", regexp_replace(col("time"),",",";#"))
Several appropriate solutions were also proposed using groupby, explode, aggregate as well.
One approach would be to flatten the array of animal attribute structs using SQL function inline and aggregate the attributes via collect_list, followed by concatenating with the specific delimiter.
Given a DataFrame df similar to your provided schema, the following transformations will generate the wanted dataset, dfResult:
val attribCSVs = List("time", "status", "llamaType").map(
c => concat_ws("#;", collect_list(c)).as(c)
)
val dfResult = df.
select($"eventId", expr("inline(llamaEvent.activity.Animal)")).
groupBy("eventId").agg(attribCSVs.head, attribCSVs.tail: _*)
Note that an event identifying column eventId is added to the sample json data for the necessary groupBy aggregation.
Let's assemble some sample data:
val jsons = Seq(
"""{
"eventId": 1,
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-1-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-2-2020",
"status":"Sitting",
"llamaType":"blue llama"
}
]
}
}
}""",
"""{
"eventId": 2,
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-2-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-3-2020",
"status":"Standing",
"llamaType":"blue llama"
}
]
}
}
}"""
)
val df = spark.read.option("multiLine", true).json(jsons.toDS)
df.show(false)
+-------+----------------------------------------------------------------------+
|eventId|llamaEvent |
+-------+----------------------------------------------------------------------+
|1 |{{[{red llama, Running, 5-1-2020}, {blue llama, Sitting, 6-2-2020}]}} |
|2 |{{[{red llama, Running, 5-2-2020}, {blue llama, Standing, 6-3-2020}]}}|
+-------+----------------------------------------------------------------------+
Applying the above transformations, dfResult should look like below:
dfResult.show(false)
+-------+------------------+-----------------+---------------------+
|eventId|time |status |llamaType |
+-------+------------------+-----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting |red llama#;blue llama|
|2 |5-2-2020#;6-3-2020|Running#;Standing|red llama#;blue llama|
+-------+------------------+-----------------+---------------------+
Writing dfResult to a CSV file:
dfResult.write.option("header", true).csv("/path/to/csv")
/*
eventId,time,status,llamaType
1,5-1-2020#;6-2-2020,Running#;Sitting,red llama#;blue llama
2,5-2-2020#;6-3-2020,Running#;Standing,red llama#;blue llama
*/
This will be a working solution for you
df = spark.createDataFrame([(str([a_json]))],T.StringType())
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("llamaEvent", df.col.getItem("llamaEvent"))
df = df.withColumn("llamaEvent", F.from_json("llamaEvent", T.MapType(T.StringType(), T.StringType())))
df = df.select("*", F.explode("llamaEvent").alias("x","y"))
df = df.withColumn("Activity", F.from_json("y", T.MapType(T.StringType(), T.StringType())))
df = df.select("*", F.explode("Activity").alias("x","yy"))
df = df.withColumn("final_col", F.from_json("yy", T.ArrayType(T.StringType())))
df = df.withColumn("final_col", F.explode("final_col"))
df = df.withColumn("final_col", F.from_json("final_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("time", df.final_col.getItem("time")).withColumn("status", df.final_col.getItem("status")).withColumn("llamaType", df.final_col.getItem("llamaType")).withColumn("agg_col", F.lit("1"))
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
display(df)
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
| value| col| llamaEvent| x| y| Activity| x| yy| final_col| time| status| llamaType|agg_col|
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
|[{'llamaEvent': {...|[llamaEvent -> {"...|[activity -> {"An...|activity|{"Animal":[{"time...|[Animal -> [{"tim...|Animal|[{"time":"5-1-202...|[time -> 5-1-2020...|5-1-2020|Running| red llama| 1|
|[{'llamaEvent': {...|[llamaEvent -> {"...|[activity -> {"An...|activity|{"Animal":[{"time...|[Animal -> [{"tim...|Animal|[{"time":"5-1-202...|[time -> 6-2-2020...|6-2-2020|Sitting|blue llama| 1|
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
df_grp.show(truncate=False)
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

Read csv into Dataframe with nested column

I have a csv file like this:
weight,animal_type,animal_interpretation
20,dog,"{is_large_animal=true, is_mammal=true}"
3.5,cat,"{is_large_animal=false, is_mammal=true}"
6.00E-04,ant,"{is_large_animal=false, is_mammal=false}"
And I created case class schema with the following:
package types
case class AnimalsType (
weight: Option[Double],
animal_type: Option[String],
animal_interpretation: Option[AnimalInterpretation]
)
case class AnimalInterpretation (
is_large_animal: Option[Boolean],
is_mammal: Option[Boolean]
)
I tried to load the csv into a dataframe with:
var df = spark.read.format("csv").option("header", "true").load("src/main/resources/animals.csv").as[AnimalsType]
But got the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from animal_interpretation#12: need struct type but got string;
Am I doing something wrong? What would be the proper way of doing this?
You can not assigned schema to csv json directly. You need to do transform csv String column (animal_interpretation) into Json format, As I have done in below code using UDF. if you can get input data in format like df1 then there is no need of below UDF you can continue from df1 and get final dataframe df2.
There is no need of any case class since your data header contain column and for json data you need to declare schema AnimalInterpretationSch as below
scala> import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
//Input CSV DataFrame
scala> df.show(false)
+--------+-----------+---------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+---------------------------------------+
|20 |dog |{is_large_animal=true, is_mammal=true} |
|3.5 |cat |{is_large_animal=false, is_mammal=true}|
|6.00E-04|ant |{is_large_animal=false,is_mammal=false}|
+--------+-----------+---------------------------------------+
//UDF to convert "animal_interpretation" column to Json Format
scala> def StringToJson:UserDefinedFunction = udf((data:String,JsonColumn:String) => {
| var out = data
| val JsonColList = JsonColumn.trim.split(",").toList
| JsonColList.foreach{ rr =>
| out = out.replaceAll(rr, "'"+rr+"'")
| }
| out = out.replaceAll("=", ":")
| out
| })
//All column from Json
scala> val JsonCol = "is_large_animal,is_mammal"
//New dataframe with Json format
scala> val df1 = df.withColumn("animal_interpretation", StringToJson(col("animal_interpretation"), lit(JsonCol)))
scala> df1.show(false)
+--------+-----------+-------------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+-------------------------------------------+
|20 |dog |{'is_large_animal':true, 'is_mammal':true} |
|3.5 |cat |{'is_large_animal':false, 'is_mammal':true}|
|6.00E-04|ant |{'is_large_animal':false,'is_mammal':false}|
+--------+-----------+-------------------------------------------+
//Schema declarion of Json format
scala> val AnimalInterpretationSch = new StructType().add("is_large_animal", BooleanType).add("is_mammal", BooleanType)
//Accessing Json columns
scala> val df2 = df1.select(col("weight"), col("animal_type"),from_json(col("animal_interpretation"), AnimalInterpretationSch).as("jsondata")).select("weight", "animal_type", "jsondata.*")
scala> df2.printSchema
root
|-- weight: string (nullable = true)
|-- animal_type: string (nullable = true)
|-- is_large_animal: boolean (nullable = true)
|-- is_mammal: boolean (nullable = true)
scala> df2.show()
+--------+-----------+---------------+---------+
| weight|animal_type|is_large_animal|is_mammal|
+--------+-----------+---------------+---------+
| 20| dog| true| true|
| 3.5| cat| false| true|
|6.00E-04| ant| false| false|
+--------+-----------+---------------+---------+

Converting Array of Strings to String with different delimiters in Spark Scala

I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.
Examples below:
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Input as Dataframe:
+--------------------+
|carLineName |
+--------------------+
|[Avalon,CRV,Camry] |
|[Model T, Model S] |
|[Cayenne, Mustang] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;# Jeep |
Current code which produces the input above:
val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))
You can use native function array_join (it is available since Spark 2.4):
import org.apache.spark.sql.functions.{array_join}
val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")
df.withColumn("str", array_join($"carLineName", ";#")).show()
+--------------------+------------------+
| carLineName| str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
| [Model T, Model S]| Model T;#Model S|
| [Cayenne, Mustang]| Cayenne;#Mustang|
| [Pilot, Jeep]| Pilot;#Jeep|
+--------------------+------------------+
you can create a user defined function that concatenate elements with "#;" separator as the following example:
val df1 = Seq(
("1", Array("t1", "t2")),
("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")
import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))
+---+------------+----------+
| id| arr| formated|
+---+------------+----------+
| 1| [t1, t2]| t1#;t2|
| 2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+
You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))
This udf could be made generic further by passing the delimiter as the second parameter.
import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))
Since you are using 1.6, we can do simple map of Row to WrappedArray.
Here is how it goes.
Input :
scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
| (Array("Model T", "Model S")),
| (Array("Cayenne", "Mustang")),
| (Array("Pilot", "Jeep"))
| ).toDF("carLineName")
carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]
Schema ::
scala> carLineDf.printSchema
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :
scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray
scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;#Jeep |
+-------------------+
// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)
That's it. You might have to use a dataframe.rdd otherwise this should do.

create empty array-column of given schema in Spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite:
I have a DataFrame with the following schema :
|-- id: long (nullable = false)
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
and the following content:
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| null|
+---+-----------+
I'd like to replace the null-array (id=2) with an empty array, i.e.
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
I've tried:
val arrSchema = df.schema(1).dataType
df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()
which gives :
java.lang.ClassCastException: org.apache.spark.sql.types.NullType$
cannot be cast to org.apache.spark.sql.types.StructType
Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from df at runtime
I'm using Spark 2.1 by the way, therefore I cannot use typedLit
Spark 2.2+ with known external type
In general you can use typedLit to provide empty arrays.
import org.apache.spark.sql.functions.typedLit
typedLit(Seq.empty[(Double, Double)])
To use specific names for nested objects you can use case classes:
case class Item(x: Double, y: Double)
typedLit(Seq.empty[Item])
or rename by cast:
typedLit(Seq.empty[(Double, Double)])
.cast("array<struct<x: Double, y: Double>>")
Spark 2.1+ with schema only
With schema only you can try:
val schema = StructType(Seq(
StructField("arr", StructType(Seq(
StructField("x", DoubleType),
StructField("y", DoubleType)
)))
))
def arrayOfSchema(schema: StructType) =
from_json(lit("""{"arr": []}"""), schema)("arr")
arrayOfSchema(schema).alias("arr")
where schema can be extracted from the existing DataFrame and wrapped with additional StructType:
StructType(Seq(
StructField("arr", df.schema("arr").dataType)
))
One way is the use a UDF :
val arrSchema = df.schema(1).dataType // ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true)
val emptyArr = udf(() => Seq.empty[Any],arrSchema)
df
.withColumn("arr",when($"arr".isNull,emptyArr()).otherwise($"arr"))
.show()
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
Another approach would be to use coalesce:
val df = Seq(
(Some(1), Some(Array((1.0, 2.0)))),
(Some(2), None)
).toDF("id", "arr")
df.withColumn("arr", coalesce($"arr", typedLit(Array.empty[(Double, Double)]))).
show
// +---+-----------+
// | id| arr|
// +---+-----------+
// | 1|[[1.0,2.0]]|
// | 2| []|
// +---+-----------+
UDF with case class could also be interesting:
case class Item(x: Double, y: Double)
val udf_emptyArr = udf(() => Seq[Item]())
df
.withColumn("arr",coalesce($"arr",udf_emptyArr()))
.show()

Strange behavior when casting array of structs in spark scala

I'm encountering a strange behavior using spark 2.1.1 and scala 2.11.8:
import spark.implicits._
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
df.show(false)
df.printSchema()
+---+-------+
|id |data |
+---+-------+
|1 |[[a,b]]|
|2 |[[c,d]]|
+---+-------+
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
Now I want to rename my struct fields as suggested in https://stackoverflow.com/a/39781382/1138523
df
.select($"id",$"data".cast("array<struct<k:string,v:string>>"))
.show()
Which results in the correct schema, but the content of the dataframe is now:
+---+-------+
| id| data|
+---+-------+
| 1|[[c,d]]|
| 2|[[c,d]]|
+---+-------+
Both lines show now the same array. What am I doing wrong?
EDIT: In spark 2.1.2 (and also spark 2.3.0) I get the expected output. I also get the expected output if I cache the dataframe:
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
.cache

Categories