In BigQuery, I have a field that is of type RECORD and in REPEATED mode, a column called actions. In Spark, I have a schema defined as
val action: StructType = (new StructType)
.add("id", StringType)
.add("name", StringType)
.add("last", StringType)
val actionsList = new ArrayType(action, true)
val finalStruct: StructType = (new StructType)
.add("record", StringType)
.add("d", StringType)
.add("actions", actionsList)
This is how my schema is defined, then I simply read it in and write it to bigquery.
val df = spark.read.schema(finalStruct).json(rdd)
df.createOrReplaceTempView("myData")
val finalDf = sqlContext.sql("SELECT record as my_rec, d as inc_date, actions from myData")
finalDf.write.mode("append").format("bigquery")...save()
However, when I attempt to write the dataframe, I get the error -
BigQuery error was provided Schema does not match Table <table_name_here>.
Cannot add fields (field: actions.list)
What's the proper way to define this schema? My data coming in is in json format like
{
"recordName":"name_here",
"date": "2020-01-01",
"actions": [
{
"id":"1",
"name":"aaa",
"last":"bbb"
},
{
"id":"2",
"name":"qqq",
"last":"www"
}
]
It's a known issue when the connector is used on the default settings with Parquet format used as an intermediate later (see similar bug report).
Changing the format to ORC solves the issue:
spark.conf.set("spark.datasource.bigquery.intermediateFormat", "orc")
Related
I am trying to come up with a schema definition to parse out information from dataframe string column I am using from_json for that . I need help in defining schema which I am somehow not getting it right.
Here is the Json I have
[
{
"sectionid":"838096e332d4419191877a3fd40ed1f4",
"sequence":0,
"questions":[
{
"xid":"urn:com.mheducation.openlearning:lms.assessment.author:qastg.global:assessment_item:2a0f52fb93954f4590ac88d90888be7b",
"questionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"quizquestionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"qtype":"3",
"sequence":0,
"subsectionsequence":-1,
"type":"80",
"question":"<p>This is a simple, 1 question assessment for automation testing</p>",
"totalpoints":"5.0",
"scoring":"1",
"scoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"inputoption":"0",
"casesensitive":"0",
"suggestedscoring":"1",
"suggestedscoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"answers":[
"1"
],
"options":[
]
}
]
}
]
I want to parse this information out which will result in columns
sectionid , sequence, xid, question.sequence, question.question(question text), answers
Here is what I have I have defined a schema for testing like this
import org.apache.spark.sql.types.{StringType, ArrayType, StructType,
StructField}
val schema = new StructType()
.add("sectionid", StringType, true)
.add("sequence", StringType, true)
.add("questions", StringType, true)
.add("answers", StringType, true)
finalDF = finalDF
.withColumn( "parsed", from_json(col("enriched_payload.transformed"),schema) )
But I am getting NULL in result columns the reason I think is my schema is not right.
I am struggling to come up with right definition . How do I come up with correct json schema definition ?
I am using spark 3.0
Try below code.
import org.apache.spark.sql.types._
val schema = ArrayType(
new StructType()
.add("sectionid",StringType,true)
.add("sequence",LongType,true)
.add("questions", ArrayType(
new StructType()
.add("answers",ArrayType(StringType,true),true)
.add("casesensitive",StringType,true)
.add("inputoption",StringType,true)
.add("options",ArrayType(StringType,true),true)
.add("qtype",StringType,true)
.add("question",StringType,true)
.add("questionid",StringType,true)
.add("quizquestionid",StringType,true)
.add("scoring",StringType,true)
.add("scoringrules",StringType,true)
.add("sequence",LongType,true)
.add("subsectionsequence",LongType,true)
.add("suggestedscoring",StringType,true)
.add("suggestedscoringrules",StringType,true)
.add("totalpoints",StringType,true)
.add("type",StringType,true)
.add("xid",StringType,true)
)
)
)
So I have data streaming in like the following:
{
"messageDetails":{
"id": "1",
"name": "2"
},
"messageMain":{
"date": "string",
"details": [{"val1":"abcd","val2":"efgh"},{"val1":"aaaa","val2":"bbbb"}]
}
Here is an example message. Normally, I would define a schema like the following:
val tableSchema: StructType = (new StructType)
.add("messageDetails", (new StructType)
.add("id", StringType)
.add("name", StringType))
.add("messageMain", (new StructType)
.add("date", StringType)
.add("details", ???) ????)
Then read in the messages like so -
val df = spark.read.schema(tableSchema).json(rdd)
However, I am not sure how to define details as it's a list of objects and not a structtype. I do not want to simple explode the rows if there is another way.. because the end goal of this would be to write back to a google BigQuery table that sets details to a repeated record type.
You want an ArrayType of StructType holding val1 and val2 StringType's
e.g.
val itemSchema = (new StructType)
.add("val1", StringType)
.add("val2", StringType)
val detailsSchema = new ArrayType(itemSchema, false)
val tableSchema: StructType = (new StructType)
.add("messageDetails", (new StructType)
.add("id", StringType)
.add("name", StringType))
.add("messageMain", (new StructType)
.add("date", StringType)
.add("details", detailsSchema))
So, the thing is since my schema could depend on kafka header/key, I want to apply schema at message level rather then dataframe level. How to achieve this? Thanks
The code snippet to apply schema for dataframe level is:
val ParsedDataFrame = kafkaStreamData.selectExpr("CAST(value AS STRING)", "CAST(key AS STRING)")
.select(from_json(col("value"), Schema), col("key"))
.select("value.*","key")
I want something like,
if(key == 'a'){
use Schema1
}
else{
use Schema2
}
P.S: I tried using foreach and map function but none worked, maybe not using them correctly
It is not possible to apply a different schema in the same row as you would end up getting an AnalysisException due to data type mismatch.
To test this you can do the following experiment.
Have the following data in Kafka topic in the forma key:::value:
a:::{"a":"foo","b":"bar"}
b:::{"a":"foo","b":"bar"}
In your streaming query you define:
val schemaA = new StructType().add("a", StringType)
val schemaB = new StructType().add("b", StringType)
val df = spark.readStream
.format("kafka")
.[...]
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.withColumn("parsedJson",
when(col("key") === "a", from_json(col("value"), schemaA))
.otherwise(from_json(col("value"), schemaB)))
This will result in an AnalysisException:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`key` = 'a') THEN jsontostructs(`value`) ELSE jsontostructs(`value`) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
'Project [key#21, value#22, CASE WHEN (key#21 = a) THEN jsontostructs(StructField(a,StringType,true), value#22) ELSE jsontostructs(StructField(b,StringType,true), value#22) END AS parsedJson#27]
As #OneCricketeer mentioned in the comments, you need to seperate the kafka input stream into several Dataframes based on a filter first and then apply different schemas for parsing a column with a json string.
I had a list of columns, By using these columns prepared a schema
code :
import org.apache.spark.sql.types._
val fields = Array("col1", "col2", "col3", "col4", "col5", "col6")
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true, null) ) )
then schema prepared as
StructType(StructField(col1,StringType,true), StructField(col2,StringType,true),
StructField(col3,StringType,true), StructField(col4,StringType,true),
StructField(col5,StringType,true), StructField(col6,StringType,true))
But I am getting NullPointerException when I try to read the data from a json file using above schema.
// reading the data
spark.read.schema(dynSchema).json("./file/path/*.json")
But it is working if I add array to StructType.
Please help me to generate dynamic schema.
Edit : If i create schema with above fields, I can able to read the data from json.
StructType(Array(
StructField("col1",StringType,true), StructField("col2",StringType,true),
StructField("col3",StringType,true), StructField("col4",StringType,true),
StructField("col5",StringType,true), StructField("col6",StringType,true)))
Simply remove the null argument from the creation of the StructField as follows:
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true)))
The last argument is used to define metadata about the column. Its default value is not null but Metadata.empty. See the source code for more detail. In the source code, they assume that it cannot be null and call methods on it without any checks. This is why you get a NullPointerException.
I have an input file which looks much like csv but with custom header:
FIELDS-START
field1
field2
field3
FIELDS-END-DATA-START
val1,2,3
val2,4,5
DATA-END
Task:
To read data to a typed dataframe, schema is obtained dynamically, example for this specific file:
val schema = StructType(
StructField("field1", StringType, true) ::
StructField("field2", IntegerType, true) ::
StructField("field3", IntegerType, true) :: Nil
)
So bacause of custom header I can't use spark csv reader. Other thing I tried:
val file = spark.sparkContext.textFile(...)
val data: RDD[List[String]] = file.filter(_.contains(",")).map(_.split(',').toList)
val df: DataFrame = spark.sqlContext.createDataFrame(data.map(Row.fromSeq(_)), schema)
It fails with runtime exception
java.lang.String is not a valid external type for schema of int which is because createDataFrame doesn't do any casting.
NOTE: Schema is obtained at runtime
Thanks in advance!