writeStream() is printing null values in batches data even i supply proper json data in kafka through writeStream() - scala

I am trying to convert json using schema and printing values to console, but writeStream() is printing null values in all columns even i gave proper data.
data i am giving to kafka topic ..
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
Below is my scala code..
val readStreamDFInd = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "IndiaStocks")
.option("startingOffsets", "earliest")
.load()
//readStreamDFInd.printSchema()
val readStreamDFUS = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "USStocks")
.option("startingOffsets", "earliest")
.load()
val schema = new StructType()
.add("stock", StringType)
.add("buy", IntegerType)
.add("sell", IntegerType)
.add("profit", IntegerType)
.add("quantity", IntegerType)
.add("loss", IntegerType)
.add("gender", StringType)
val stocksIndia = readStreamDFInd.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
val stocksUSA = readStreamDFUS.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
stocksIndia.printSchema() stocksUSA.writeStream
.format("console")
.outputMode("append").trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()

The code works fine as you can also see in the book.
Looking at the documentation of the from_json function the null values are created because the string is unparseable.
=> You are missing the quotations around the quantity field in your json string.

Problem is in your kafka data, quantity column should be in quotes. Please check below.
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}

Related

How to Read data from kafka topic with different schema (has some optional objects) in structured streaming

i have data coming in kafka topic which has an optional object , and since its optional i am missing those records when reading with a defined schema
ex :
schema i have :
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present .
how to read the data which will not have obj_3 sometime.
sample code :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))
You could either
use a flag (e.g. called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string. Something like:
import org.apache.spark.sql.functions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json. The code will look very similar to the one for the other option.
Please note, that I have written the code on my mobile, so you may find some syntax issues. However, hope the idea gets across.

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

How to send parquet to kafka in batches using strcutured spark streaming?

I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("yarn")
.appName("Write Parquet to Kafka")
.getOrCreate()
import spark.implicits._
val ds: DataFrame = spark.readStream
.schema(parquet-schema)
.parquet(path-to-parquet-file)
val df: DataFrame = ds.select($"vin" as "key", to_json( struct( ds.columns.map(col(_)):_* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.writeStream
.format("kafka")
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
ddf.awaitTermination()
}
}
Is it possible to do this?
I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.readStream
.option("maxFilesPerTrigger", 1)
.schema(parquetSchema)
.parquet(parqurtUri)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.

Spark : Kafka consumer getting data as base64 encoded strings even though Producer does not explictly encode

I am trying out a simple example to publish data to Kafka and consume it using Spark.
Here is the Producer code:
var kafka_input = spark.sql("""
SELECT CAST(Id AS STRING) as key,
to_json(
named_struct(
'Id', Id,
'Title',Title
)
) as value
FROM offer_data""")
kafka_input.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("topic", topicName)
.save()
I verified that kafka_inputhas json string for value and the a number casted as string for key.
Here is the consumer code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.take(50)
display(df)
The data I receive on the consumer side is base64 encoded string.
How do I decode the value in Scala?
Also, this read statement is not flushing these records from the Kafka queue. I am assuming this is because I am not sending any ack signal back to Kafka. IS that correct? If so, how do I do that?
try this..
df.foreach(row => {
val key = row.getAs[Array[Byte]]("key")
val value = row.getAs[Array[Byte]]("value")
println(scala.io.Source.fromBytes(key,"UTF-8").mkString)
println(scala.io.Source.fromBytes(value,"UTF-8").mkString)
})
The problem was with my usage of SelectExpr..It does nto do an in-place transofrmation..it returns the transformed data.
Fix:
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
display(df1)

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")