How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming? - scala

I have a Spark Structured Streaming job which is streaming data from multiple Kafka topics based on a subscribePattern and for every Kafka topic I have a Spark schema. When streaming the data from Kafka I want to apply the Spark schema to the Kafka message based on the topic name.
Consider I have two topics: cust & customers.
Streaming data from Kafka based on subscribePattern (Java regex string):
var df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "cust*")
.option("startingOffsets", "earliest")
.load()
.withColumn("value", $"value".cast("string"))
.filter($"value".isNotNull)
the above streaming query streams data from both the topics.
Let's say I have two Spark schemas one for each topic:
var cust: StructType = new StructType()
.add("name", StringType)
.add("age", IntegerType)
var customers: StructType = new StructType()
.add("id", IntegerType)
.add("first_name", StringType)
.add("last_name", StringType)
.add("email", StringType)
.add("address", StringType)
Now, I want to apply the Spark Schema based on topic name and to do that I have written a udf which reads the topic name and returns the schema in DDL format:
val schema = udf((table: String) => (table) match {
case ("cust") => cust.toDDL
case ("customers") => customers.toDDL
case _ => new StructType().toDDL
})
Then I am using the udf (I understand that udf applies on every column) inside the from_json method like this:
val query = df
.withColumn("topic", $"topic".cast("string"))
.withColumn("data", from_json($"value", schema($"topic")))
.select($"key", $"topic", $"data.*")
.writeStream.outputMode("append")
.format("console")
.start()
.awaitTermination()
This gives me the following exception which is correct because from_json expects String schema in DDL format or StructType.
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of UDF(topic);
I want to know how to accomplish this?
Any help will be appreciated!

What you're doing is not possible. Your query df can't have 2 different schemas.
I can think of 2 ways to do it:
Split your df by topic, then apply your 2 schemas to 2 dfs (cust and customers)
Merge the 2 schemas into 1 schema and apply that to all topics.

Related

Extracting data from spark streaming dataframe

I'm new to Spark Streaming.
I'm getting an event similar to below from Kafka. I have to extract the path from the dataframe, read the data from the path and write it to a destination.
{"path":["/tmp/file_path/file.parquet"],"format":"parquet","entries":null}
Any idea on how to extract the path and format the spark streaming dataframe?
Here's what I'm trying to achieve,
val df: DataFrame = spark.readStream.format("kafka").
option("kafka.bootstrap.servers", "localhost:9092").
option("subscribe", "kafka-test-event").
option("startingOffsets", "earliest").load()
df.printSchema()
val valDf = df.selectExpr("CAST(value AS STRING)")
val path = valDf.collect()(0).getString(0)
println("path - "+ path)
val newDf = spark.read.parquet(path)
newDf.selectExpr("CAST(value AS STRING)").writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
When I try to do a collect on the dataframe it throws an Unsupported operation exception.
You are getting that error because you are trying to do static operations on a streaming dataframe.
you can try something like below.
After you read the streaming dataframe from Kafka, try the below
Create a schema class for parsing the incoming data
val incomingSchema = new StructType()
.add("path", StringType)
.add("format", StringType)
.add("entries", StringType)
Associate that schema on top of incoming data and you can select the required fields from your data and do transformations on top of it.
val valDf = df.selectExpr("CAST(value AS STRING) as jsonEntry").select(from_json($"jsonEntry",incomingSchema).select("path")

How to Read data from kafka topic with different schema (has some optional objects) in structured streaming

i have data coming in kafka topic which has an optional object , and since its optional i am missing those records when reading with a defined schema
ex :
schema i have :
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present .
how to read the data which will not have obj_3 sometime.
sample code :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))
You could either
use a flag (e.g. called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string. Something like:
import org.apache.spark.sql.functions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json. The code will look very similar to the one for the other option.
Please note, that I have written the code on my mobile, so you may find some syntax issues. However, hope the idea gets across.

Applying schema at message level instead of dataframe level in Spark Structured Streaming

So, the thing is since my schema could depend on kafka header/key, I want to apply schema at message level rather then dataframe level. How to achieve this? Thanks
The code snippet to apply schema for dataframe level is:
val ParsedDataFrame = kafkaStreamData.selectExpr("CAST(value AS STRING)", "CAST(key AS STRING)")
.select(from_json(col("value"), Schema), col("key"))
.select("value.*","key")
I want something like,
if(key == 'a'){
use Schema1
}
else{
use Schema2
}
P.S: I tried using foreach and map function but none worked, maybe not using them correctly
It is not possible to apply a different schema in the same row as you would end up getting an AnalysisException due to data type mismatch.
To test this you can do the following experiment.
Have the following data in Kafka topic in the forma key:::value:
a:::{"a":"foo","b":"bar"}
b:::{"a":"foo","b":"bar"}
In your streaming query you define:
val schemaA = new StructType().add("a", StringType)
val schemaB = new StructType().add("b", StringType)
val df = spark.readStream
.format("kafka")
.[...]
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.withColumn("parsedJson",
when(col("key") === "a", from_json(col("value"), schemaA))
.otherwise(from_json(col("value"), schemaB)))
This will result in an AnalysisException:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`key` = 'a') THEN jsontostructs(`value`) ELSE jsontostructs(`value`) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
'Project [key#21, value#22, CASE WHEN (key#21 = a) THEN jsontostructs(StructField(a,StringType,true), value#22) ELSE jsontostructs(StructField(b,StringType,true), value#22) END AS parsedJson#27]
As #OneCricketeer mentioned in the comments, you need to seperate the kafka input stream into several Dataframes based on a filter first and then apply different schemas for parsing a column with a json string.

spark structured streaming dataframe/dataset apply map, iterate record and look into hbase table using shc connector

I am using structured streaming with Spark 2.1.1 Needs to construct the key from coulmns available in streaming dataframe(kafka source) and get the hbase table data in dataframe(static) using shc spark hbase connector. Then apply business logic using both dataframes.
Planning to construct key from iterating records in streaming dataframe, and for each record after constructing key look into hbase table, get the dataframe using shc connector, then apply some business logic using both dataframes. then send response data to kafka topic.
structured streaming with Spark 2.1.1, kafka data source, shc spark hbase connector
val StreamingDF= spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribePattern", kafkaReqTopic)
.load()
val responseDF = StreamingDF.mapPartitions(rowIter => rowIter.map {
row =>
import spark.sqlContext.implicits._
def catalog = s"""{
|"table":{"namespace":"default", "name":"abchbasetable"},
|"rowkey":"abchbasetablerowkey",
|"columns":{
|"rowkey":{"cf":"rowkey", "col":"abchbasetablerowkey", "type":"string"},
|"col1":{"cf":"topo", "col":"col1", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog -> cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val staticDF = withCatalog(catalog)
staticDF.show(10, false)
})
val kafkaOutput = abc.responseDF
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("topic", topicname)
.option("checkpointLocation", "/pathtocheckpoint")`enter code here`
.start()
planned to get the static dataframe from hbase and append coulmns to streaming dataframe

Writing streaming dataframe to kafka

I am reading log lines from kafka topic through spark structured streaming,separating the fields of loglines, performing some manipulations on fields and storing storing it in dataframe with separate columns for every fields. I want to write this dataframe to kafka
Below is my sample dataframe and writestream for writing it to kafka
val dfStructuredWrite = dfProcessedLogs.select(
dfProcessedLogs("result").getItem("_1").as("col1"),
dfProcessedLogs("result").getItem("_2").as("col2"),
dfProcessedLogs("result").getItem("_17").as("col3"))
dfStructuredWrite
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
Above code gives me below error
Required attribute 'value' not found
I believe this is because I don't have my dataframe in key/value format.How can I write my existing dataframe to kafka in most efficient way?
The Dataframe being written to Kafka should have the following columns in schema:
key (optional) (type: string or binary)
value (required) (type: string or binary)
topic (optional) (type: string)
In your case there is no value column and the exception is thrown.
You have to modify it to add at least value column, ex:
import org.apache.spark.sql.functions.{concat, lit}
dfStructuredWrite.select(concat($"col1", lit(" "), $"col2", lit(" "), $"col3").alias("value"))
For more details you can check: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka