How to define a schema for json to be used in from_json to parse out values - scala

I am trying to come up with a schema definition to parse out information from dataframe string column I am using from_json for that . I need help in defining schema which I am somehow not getting it right.
Here is the Json I have
[
{
"sectionid":"838096e332d4419191877a3fd40ed1f4",
"sequence":0,
"questions":[
{
"xid":"urn:com.mheducation.openlearning:lms.assessment.author:qastg.global:assessment_item:2a0f52fb93954f4590ac88d90888be7b",
"questionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"quizquestionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"qtype":"3",
"sequence":0,
"subsectionsequence":-1,
"type":"80",
"question":"<p>This is a simple, 1 question assessment for automation testing</p>",
"totalpoints":"5.0",
"scoring":"1",
"scoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"inputoption":"0",
"casesensitive":"0",
"suggestedscoring":"1",
"suggestedscoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"answers":[
"1"
],
"options":[
]
}
]
}
]
I want to parse this information out which will result in columns
sectionid , sequence, xid, question.sequence, question.question(question text), answers
Here is what I have I have defined a schema for testing like this
import org.apache.spark.sql.types.{StringType, ArrayType, StructType,
StructField}
val schema = new StructType()
.add("sectionid", StringType, true)
.add("sequence", StringType, true)
.add("questions", StringType, true)
.add("answers", StringType, true)
finalDF = finalDF
.withColumn( "parsed", from_json(col("enriched_payload.transformed"),schema) )
But I am getting NULL in result columns the reason I think is my schema is not right.
I am struggling to come up with right definition . How do I come up with correct json schema definition ?
I am using spark 3.0

Try below code.
import org.apache.spark.sql.types._
val schema = ArrayType(
new StructType()
.add("sectionid",StringType,true)
.add("sequence",LongType,true)
.add("questions", ArrayType(
new StructType()
.add("answers",ArrayType(StringType,true),true)
.add("casesensitive",StringType,true)
.add("inputoption",StringType,true)
.add("options",ArrayType(StringType,true),true)
.add("qtype",StringType,true)
.add("question",StringType,true)
.add("questionid",StringType,true)
.add("quizquestionid",StringType,true)
.add("scoring",StringType,true)
.add("scoringrules",StringType,true)
.add("sequence",LongType,true)
.add("subsectionsequence",LongType,true)
.add("suggestedscoring",StringType,true)
.add("suggestedscoringrules",StringType,true)
.add("totalpoints",StringType,true)
.add("type",StringType,true)
.add("xid",StringType,true)
)
)
)

Related

Dataframe Schema prints out empty

I'm trying out the Scala Spark creating dataframes.
But this returns an empty DF Schema and empty DF.
Can someone tell me what is the issue here? Thanks!
val simpleData = Seq(Row("James","","Smith","36636"),
Row("Michael","Rose","","40288"),
Row("Robert","","Williams","42114"),
Row("Maria","Anne","Jones","39192"),
Row("Jen","Mary","Brown","")
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true)
))
val df = spark.createDataFrame(
sc.parallelize(simpleData),simpleSchema)
logger.info(s"df printschema: ${df.printSchema()}")
logger.info(s"df show: ${df.show}")
```
df.printSchema() will print the schema of the dataframe to output and not return a value that would be printed by your logger. You may access the schema as a StructType using df.schema.
Try changing
logger.info(s"df printschema: ${df.printSchema()}")
to
logger.info(s"df printschema: ${df.schema.simpleString}")
or
logger.info(s"df printschema: ${df.schema.json}")
for the json representation.
Let me know if this works for you.

Define StructType as input datatype of a Function Spark-Scala 2.11 [duplicate]

This question already has an answer here:
Defining a UDF that accepts an Array of objects in a Spark DataFrame?
(1 answer)
Closed 3 years ago.
I'm trying to write a Spark UDF in scala, I need to define a Function's input datatype
I have a schema variable with the StructType, mentioned the same below.
import org.apache.spark.sql.types._
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
I'm trying to write a Function like below
val relationsFunc: Array[Map[String,String]] => Array[String] = _.map(do something)
val relationUDF = udf(relationsFunc)
input.withColumn("relation",relationUDF(col("relation")))
above code throws below exception
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(relation)' due to data type mismatch: argument 1 requires array<map<string,string>> type, however, '`relation`' is of array<struct<attribute:string,email:string,fname:string,lname:string>> type.;;
'Project [relation#89, UDF(relation#89) AS proc#273]
if I give the input type as
val relationsFunc: StructType => Array[String] =
I'm not able to implement the logic, as _.map gives me metadata, filed names, etc.
Please advice how to define relationsSchema as input datatype in the below function.
val relationsFunc: ? => Array[String] = _.map(somelogic)
Your structure under relation is a Row, so your function should have the following signature :
val relationsFunc: Array[Row] => Array[String]
then you can access your data either by position or by name, ie :
{r:Row => r.getAs[String]("email")}
Check the mapping table in the documentation to determine the data type representations between Spark SQL and Scala: https://spark.apache.org/docs/2.4.4/sql-reference.html#data-types
Your relation field is a Spark SQL complex type of type StructType, which is represented by Scala type org.apache.spark.sql.Row so this is the input type you should be using.
I used your code to create this complete working example that extracts email values:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(
Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
val data = Seq(
Row("{'relation':[{'attribute':'1','email':'johnny#example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
relationsSchema
)
val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)
df.withColumn("relation", relationUdf(col("relation")))

Converting RDD into Dataframe

I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.

Dynamic schema generation from array of columns

I had a list of columns, By using these columns prepared a schema
code :
import org.apache.spark.sql.types._
val fields = Array("col1", "col2", "col3", "col4", "col5", "col6")
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true, null) ) )
then schema prepared as
StructType(StructField(col1,StringType,true), StructField(col2,StringType,true),
StructField(col3,StringType,true), StructField(col4,StringType,true),
StructField(col5,StringType,true), StructField(col6,StringType,true))
But I am getting NullPointerException when I try to read the data from a json file using above schema.
// reading the data
spark.read.schema(dynSchema).json("./file/path/*.json")
But it is working if I add array to StructType.
Please help me to generate dynamic schema.
Edit : If i create schema with above fields, I can able to read the data from json.
StructType(Array(
StructField("col1",StringType,true), StructField("col2",StringType,true),
StructField("col3",StringType,true), StructField("col4",StringType,true),
StructField("col5",StringType,true), StructField("col6",StringType,true)))
Simply remove the null argument from the creation of the StructField as follows:
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true)))
The last argument is used to define metadata about the column. Its default value is not null but Metadata.empty. See the source code for more detail. In the source code, they assume that it cannot be null and call methods on it without any checks. This is why you get a NullPointerException.

Spark error when using except on a dataframe with MapType

I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?