Applying schema at message level instead of dataframe level in Spark Structured Streaming - scala

So, the thing is since my schema could depend on kafka header/key, I want to apply schema at message level rather then dataframe level. How to achieve this? Thanks
The code snippet to apply schema for dataframe level is:
val ParsedDataFrame = kafkaStreamData.selectExpr("CAST(value AS STRING)", "CAST(key AS STRING)")
.select(from_json(col("value"), Schema), col("key"))
.select("value.*","key")
I want something like,
if(key == 'a'){
use Schema1
}
else{
use Schema2
}
P.S: I tried using foreach and map function but none worked, maybe not using them correctly

It is not possible to apply a different schema in the same row as you would end up getting an AnalysisException due to data type mismatch.
To test this you can do the following experiment.
Have the following data in Kafka topic in the forma key:::value:
a:::{"a":"foo","b":"bar"}
b:::{"a":"foo","b":"bar"}
In your streaming query you define:
val schemaA = new StructType().add("a", StringType)
val schemaB = new StructType().add("b", StringType)
val df = spark.readStream
.format("kafka")
.[...]
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.withColumn("parsedJson",
when(col("key") === "a", from_json(col("value"), schemaA))
.otherwise(from_json(col("value"), schemaB)))
This will result in an AnalysisException:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`key` = 'a') THEN jsontostructs(`value`) ELSE jsontostructs(`value`) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
'Project [key#21, value#22, CASE WHEN (key#21 = a) THEN jsontostructs(StructField(a,StringType,true), value#22) ELSE jsontostructs(StructField(b,StringType,true), value#22) END AS parsedJson#27]
As #OneCricketeer mentioned in the comments, you need to seperate the kafka input stream into several Dataframes based on a filter first and then apply different schemas for parsing a column with a json string.

Related

Extracting data from spark streaming dataframe

I'm new to Spark Streaming.
I'm getting an event similar to below from Kafka. I have to extract the path from the dataframe, read the data from the path and write it to a destination.
{"path":["/tmp/file_path/file.parquet"],"format":"parquet","entries":null}
Any idea on how to extract the path and format the spark streaming dataframe?
Here's what I'm trying to achieve,
val df: DataFrame = spark.readStream.format("kafka").
option("kafka.bootstrap.servers", "localhost:9092").
option("subscribe", "kafka-test-event").
option("startingOffsets", "earliest").load()
df.printSchema()
val valDf = df.selectExpr("CAST(value AS STRING)")
val path = valDf.collect()(0).getString(0)
println("path - "+ path)
val newDf = spark.read.parquet(path)
newDf.selectExpr("CAST(value AS STRING)").writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
When I try to do a collect on the dataframe it throws an Unsupported operation exception.
You are getting that error because you are trying to do static operations on a streaming dataframe.
you can try something like below.
After you read the streaming dataframe from Kafka, try the below
Create a schema class for parsing the incoming data
val incomingSchema = new StructType()
.add("path", StringType)
.add("format", StringType)
.add("entries", StringType)
Associate that schema on top of incoming data and you can select the required fields from your data and do transformations on top of it.
val valDf = df.selectExpr("CAST(value AS STRING) as jsonEntry").select(from_json($"jsonEntry",incomingSchema).select("path")

How to Read data from kafka topic with different schema (has some optional objects) in structured streaming

i have data coming in kafka topic which has an optional object , and since its optional i am missing those records when reading with a defined schema
ex :
schema i have :
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present .
how to read the data which will not have obj_3 sometime.
sample code :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))
You could either
use a flag (e.g. called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string. Something like:
import org.apache.spark.sql.functions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json. The code will look very similar to the one for the other option.
Please note, that I have written the code on my mobile, so you may find some syntax issues. However, hope the idea gets across.

How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming?

I have a Spark Structured Streaming job which is streaming data from multiple Kafka topics based on a subscribePattern and for every Kafka topic I have a Spark schema. When streaming the data from Kafka I want to apply the Spark schema to the Kafka message based on the topic name.
Consider I have two topics: cust & customers.
Streaming data from Kafka based on subscribePattern (Java regex string):
var df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "cust*")
.option("startingOffsets", "earliest")
.load()
.withColumn("value", $"value".cast("string"))
.filter($"value".isNotNull)
the above streaming query streams data from both the topics.
Let's say I have two Spark schemas one for each topic:
var cust: StructType = new StructType()
.add("name", StringType)
.add("age", IntegerType)
var customers: StructType = new StructType()
.add("id", IntegerType)
.add("first_name", StringType)
.add("last_name", StringType)
.add("email", StringType)
.add("address", StringType)
Now, I want to apply the Spark Schema based on topic name and to do that I have written a udf which reads the topic name and returns the schema in DDL format:
val schema = udf((table: String) => (table) match {
case ("cust") => cust.toDDL
case ("customers") => customers.toDDL
case _ => new StructType().toDDL
})
Then I am using the udf (I understand that udf applies on every column) inside the from_json method like this:
val query = df
.withColumn("topic", $"topic".cast("string"))
.withColumn("data", from_json($"value", schema($"topic")))
.select($"key", $"topic", $"data.*")
.writeStream.outputMode("append")
.format("console")
.start()
.awaitTermination()
This gives me the following exception which is correct because from_json expects String schema in DDL format or StructType.
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of UDF(topic);
I want to know how to accomplish this?
Any help will be appreciated!
What you're doing is not possible. Your query df can't have 2 different schemas.
I can think of 2 ways to do it:
Split your df by topic, then apply your 2 schemas to 2 dfs (cust and customers)
Merge the 2 schemas into 1 schema and apply that to all topics.

How to write file to cassandra from spark

I am new to spark and Cassandra. I Use this code but it give me error.
val dfprev = df.select(col = "se","hu")
val a = dfprev.select("se")
val b = dfprev.select("hu")
val collection = sc.parallelize(Seq(a,b))
collection.saveToCassandra("keyspace", "table", SomeColumns("se","hu"))
When I enter this code on savetocassandra, it give me error and the error is:
java.lang.IllegalArgumentException: Multiple constructors with the same number of parameters not allowed.
at com.datastax.spark.connector.util.Reflect$.methodSymbol(Reflect.scala:16)
at com.datastax.spark.connector.util.ReflectionUtil$.constructorParams(ReflectionUtil.scala:63)
at com.datastax.spark.connector.mapper.DefaultColumnMapper.(DefaultColumnMapper.scala:45) at com.datastax.spark.connector.mapper.LowPriorityColumnMapper$class.defaultColumnMapper(ColumnMapper.scala:51)
at
om.datastax.spark.connector.mapper.ColumnMapper$.defaultColumnMapper(ColumnMapper.scala:55)
val dfprev = df.select("se","hu")
dfprev.write.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace"->"YOUR_KEYSPACE_NAME","table"->"YOUR_TABLE_NAME"))
.mode(SaveMode.Append)
.save()
variable a and b are of type dataframe. sc.parallelize creates a RDD from collection of elements, it doesn't accepts dataframe as input.
Note: Set spark.cassandra.connection.host AND spark.cassandra.auth.username & spark.cassandra.auth.password (if authentication is enabled) in sparkconf

How to read records in JSON format from Kafka using Structured Streaming?

I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka.
I use:
Spark 2.10
Kafka 0.10
spark-sql-kafka-0-10
Spark Kafka DataSource has defined underlying schema:
|key|value|topic|partition|offset|timestamp|timestampType|
My data come in json format and they are stored in the value column. I am looking for a way how to extract underlying schema from value column and update received dataframe to columns stored in value? I tried the approach below but it does not work:
val columns = Array("column1", "column2") // column names
val rawKafkaDF = sparkSession.sqlContext.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe",topic)
.load()
val columnsToSelect = columns.map( x => new Column("value." + x))
val kafkaDF = rawKafkaDF.select(columnsToSelect:_*)
// some analytics using stream dataframe kafkaDF
val query = kafkaDF.writeStream.format("console").start()
query.awaitTermination()
Here I am getting Exception org.apache.spark.sql.AnalysisException: Can't extract value from value#337; because in time of creation of the stream, values inside are not known...
Do you have any suggestions?
From the Spark perspective value is just a byte sequence. It has no knowledge about the serialization format or content. To be able to extract the filed you have to parse it first.
If data is serialized as a JSON string you have two options. You can cast value to StringType and use from_json and provide a schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.from_json
val schema: StructType = StructType(Seq(
StructField("column1", ???),
StructField("column2", ???)
))
rawKafkaDF.select(from_json($"value".cast(StringType), schema))
or cast to StringType, extract fields by path using get_json_object:
import org.apache.spark.sql.functions.get_json_object
val columns: Seq[String] = ???
val exprs = columns.map(c => get_json_object($"value", s"$$.$c"))
rawKafkaDF.select(exprs: _*)
and cast later to the desired types.