How to Read data from kafka topic with different schema (has some optional objects) in structured streaming - scala

i have data coming in kafka topic which has an optional object , and since its optional i am missing those records when reading with a defined schema
ex :
schema i have :
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present .
how to read the data which will not have obj_3 sometime.
sample code :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))

You could either
use a flag (e.g. called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string. Something like:
import org.apache.spark.sql.functions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json. The code will look very similar to the one for the other option.
Please note, that I have written the code on my mobile, so you may find some syntax issues. However, hope the idea gets across.

Related

Spark Structured Streaming: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

I'm using Spark Structured Streaming (3.2.1) with Kafka.
I'm trying to simply read JSON from Kafka using a defined schema.
My problem is that in defined schema I got non-nullable field that is ignored when I read messages from Kafka. I use the from_json functions that seems to ignore that some fields can't be null.
Here is my code example:
val schemaTest = new StructType()
.add("firstName", StringType)
.add("lastName", StringType)
.add("birthDate", LongType, nullable = false)
val loader =
spark
.readStream
.format("kafka")
.option("startingOffsets", "earliest")
.option("kafka.bootstrap.servers", "BROKER:PORT")
.option("subscribe", "TOPIC")
.load()
val df = loader.
selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
df.printSchema()
val q = df.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
I got this when I'm printing the schema of df which is different of my schemaTest:
root
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- birthDate: long (nullable = true)
And received data are like that:
+---------+--------+----------+
|firstName|lastName|birthDate |
+---------+--------+----------+
|Toto |Titi |1643799912|
+---------+--------+----------+
|Tutu |Tata |null |
+---------+--------+----------+
We also try to add option to change mode in from_json function from default one PERMISSIVE to others (DROPMALFORMED, FAILFAST) but in fact the second record that doesn't respect the defined schema is simply not considered as corrupted because the field birthDate is nullable..
Maybe I missed something but if it's not the case, I got following questions.
Do you know why the printSchema of df is not like my schemaTest ? (With non nullable field)
And also, how can I manage non-nullable value in my case ? I know that I can filter but I would like to know if there is an alternative using schema like it's supposed to work. And also, It's not quite simple to filter if I got a schema with lots of fields non-nullable.
This is actually the intended behavior of from_json function. You can read the following from the source code:
// The JSON input data might be missing certain fields. We force the nullability
// of the user-provided schema to avoid data corruptions. In particular, the parquet-mr encoder
// can generate incorrect files if values are missing in columns declared as non-nullable.
val nullableSchema = schema.asNullable
override def nullable: Boolean = true
If you have multiple fields which are mandatory then you can construct the filter expression from your schemaTest (or list of columns) and use it like this:
val filterExpr = schemaTest.fields
.filter(!_.nullable)
.map(f => col(f.name).isNotNull)
.reduce(_ and _)
val df = loader
.selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.filter(filterExpr)
I would like to propose a different way of doing :
def isCorrupted(df: DataFrame): DataFrame = {
val filterNullable = schemaTest
.filter(e => !e.nullable)
.map(_.name)
filterNullable.foldLeft(df) { case ((accumulator), (columnName)) =>
accumulator.withColumn("isCorrupted", when(col(columnName).isNull, 1).otherwise(0))
}
.filter(col("isCorrupted") === lit(0))
.drop(col("isCorrupted"))
}
val df = loader
.selectExpr("CAST(value as STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.transform(isCorrupted)

Applying schema at message level instead of dataframe level in Spark Structured Streaming

So, the thing is since my schema could depend on kafka header/key, I want to apply schema at message level rather then dataframe level. How to achieve this? Thanks
The code snippet to apply schema for dataframe level is:
val ParsedDataFrame = kafkaStreamData.selectExpr("CAST(value AS STRING)", "CAST(key AS STRING)")
.select(from_json(col("value"), Schema), col("key"))
.select("value.*","key")
I want something like,
if(key == 'a'){
use Schema1
}
else{
use Schema2
}
P.S: I tried using foreach and map function but none worked, maybe not using them correctly
It is not possible to apply a different schema in the same row as you would end up getting an AnalysisException due to data type mismatch.
To test this you can do the following experiment.
Have the following data in Kafka topic in the forma key:::value:
a:::{"a":"foo","b":"bar"}
b:::{"a":"foo","b":"bar"}
In your streaming query you define:
val schemaA = new StructType().add("a", StringType)
val schemaB = new StructType().add("b", StringType)
val df = spark.readStream
.format("kafka")
.[...]
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.withColumn("parsedJson",
when(col("key") === "a", from_json(col("value"), schemaA))
.otherwise(from_json(col("value"), schemaB)))
This will result in an AnalysisException:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`key` = 'a') THEN jsontostructs(`value`) ELSE jsontostructs(`value`) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
'Project [key#21, value#22, CASE WHEN (key#21 = a) THEN jsontostructs(StructField(a,StringType,true), value#22) ELSE jsontostructs(StructField(b,StringType,true), value#22) END AS parsedJson#27]
As #OneCricketeer mentioned in the comments, you need to seperate the kafka input stream into several Dataframes based on a filter first and then apply different schemas for parsing a column with a json string.

writeStream() is printing null values in batches data even i supply proper json data in kafka through writeStream()

I am trying to convert json using schema and printing values to console, but writeStream() is printing null values in all columns even i gave proper data.
data i am giving to kafka topic ..
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
Below is my scala code..
val readStreamDFInd = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "IndiaStocks")
.option("startingOffsets", "earliest")
.load()
//readStreamDFInd.printSchema()
val readStreamDFUS = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "USStocks")
.option("startingOffsets", "earliest")
.load()
val schema = new StructType()
.add("stock", StringType)
.add("buy", IntegerType)
.add("sell", IntegerType)
.add("profit", IntegerType)
.add("quantity", IntegerType)
.add("loss", IntegerType)
.add("gender", StringType)
val stocksIndia = readStreamDFInd.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
val stocksUSA = readStreamDFUS.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
stocksIndia.printSchema() stocksUSA.writeStream
.format("console")
.outputMode("append").trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
The code works fine as you can also see in the book.
Looking at the documentation of the from_json function the null values are created because the string is unparseable.
=> You are missing the quotations around the quantity field in your json string.
Problem is in your kafka data, quantity column should be in quotes. Please check below.
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}

How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming?

I have a Spark Structured Streaming job which is streaming data from multiple Kafka topics based on a subscribePattern and for every Kafka topic I have a Spark schema. When streaming the data from Kafka I want to apply the Spark schema to the Kafka message based on the topic name.
Consider I have two topics: cust & customers.
Streaming data from Kafka based on subscribePattern (Java regex string):
var df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "cust*")
.option("startingOffsets", "earliest")
.load()
.withColumn("value", $"value".cast("string"))
.filter($"value".isNotNull)
the above streaming query streams data from both the topics.
Let's say I have two Spark schemas one for each topic:
var cust: StructType = new StructType()
.add("name", StringType)
.add("age", IntegerType)
var customers: StructType = new StructType()
.add("id", IntegerType)
.add("first_name", StringType)
.add("last_name", StringType)
.add("email", StringType)
.add("address", StringType)
Now, I want to apply the Spark Schema based on topic name and to do that I have written a udf which reads the topic name and returns the schema in DDL format:
val schema = udf((table: String) => (table) match {
case ("cust") => cust.toDDL
case ("customers") => customers.toDDL
case _ => new StructType().toDDL
})
Then I am using the udf (I understand that udf applies on every column) inside the from_json method like this:
val query = df
.withColumn("topic", $"topic".cast("string"))
.withColumn("data", from_json($"value", schema($"topic")))
.select($"key", $"topic", $"data.*")
.writeStream.outputMode("append")
.format("console")
.start()
.awaitTermination()
This gives me the following exception which is correct because from_json expects String schema in DDL format or StructType.
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of UDF(topic);
I want to know how to accomplish this?
Any help will be appreciated!
What you're doing is not possible. Your query df can't have 2 different schemas.
I can think of 2 ways to do it:
Split your df by topic, then apply your 2 schemas to 2 dfs (cust and customers)
Merge the 2 schemas into 1 schema and apply that to all topics.

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")