How to parse a json string column in pyspark's DataStreamReader and create a Data Frame - pyspark

I am reading messages from a kafka topic
messageDFRaw = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "test-message")\
.load()
messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as dict")
When I print the data frame from the above query I get the below console output.
|key|dict|
|#badbunny |{"channel": "#badbunny", "username": "mgat22", "message": "cool"}|
How can I create a data frame from the DataStreamReader such that I have a dataframe with columns as |key|channel| username| message|
I tried following the accepted answer in How to read records in JSON format from Kafka using Structured Streaming?
struct = StructType([
StructField("channel", StringType()),
StructField("username", StringType()),
StructField("message", StringType()),
])
messageDFRaw.select(from_json("CAST(value AS STRING)", struct))
but, I get Expected type 'StructField', got 'StructType' instead in from_json()

I ignored the warning Expected type 'StructField', got 'StructType' instead in from_json().
However, I had to cast the value from kafka message initially and then parse the json schema later.
messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
messageParsedDF = messageDF.select(from_json("value", struct_schema).alias("message"))
messageFlattenedDF = messageParsedDF.selectExpr("value.channel", "value.username", "value.message")

Related

How to Read data from kafka topic with different schema (has some optional objects) in structured streaming

i have data coming in kafka topic which has an optional object , and since its optional i am missing those records when reading with a defined schema
ex :
schema i have :
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present .
how to read the data which will not have obj_3 sometime.
sample code :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))
You could either
use a flag (e.g. called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string. Something like:
import org.apache.spark.sql.functions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json. The code will look very similar to the one for the other option.
Please note, that I have written the code on my mobile, so you may find some syntax issues. However, hope the idea gets across.

writeStream() is printing null values in batches data even i supply proper json data in kafka through writeStream()

I am trying to convert json using schema and printing values to console, but writeStream() is printing null values in all columns even i gave proper data.
data i am giving to kafka topic ..
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
Below is my scala code..
val readStreamDFInd = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "IndiaStocks")
.option("startingOffsets", "earliest")
.load()
//readStreamDFInd.printSchema()
val readStreamDFUS = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "USStocks")
.option("startingOffsets", "earliest")
.load()
val schema = new StructType()
.add("stock", StringType)
.add("buy", IntegerType)
.add("sell", IntegerType)
.add("profit", IntegerType)
.add("quantity", IntegerType)
.add("loss", IntegerType)
.add("gender", StringType)
val stocksIndia = readStreamDFInd.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
val stocksUSA = readStreamDFUS.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
stocksIndia.printSchema() stocksUSA.writeStream
.format("console")
.outputMode("append").trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
The code works fine as you can also see in the book.
Looking at the documentation of the from_json function the null values are created because the string is unparseable.
=> You are missing the quotations around the quantity field in your json string.
Problem is in your kafka data, quantity column should be in quotes. Please check below.
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}

How do I convert a dataframe to JSON and write to kafka topic with key

I'm trying to write a dataframe to kafka in JSON format and add a key to the data frame in Scala, i'm currently working with this sample from the kafka-spark:
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
Is there a to_json method that can be used (instead of the json(path) option which I believe writes out to a file in JSON format) and is there a key option that can be used to replace the null value with an actual key.
This is a minimal example in scala. Let's say you have a dataframe df with columns x and y. Here's a minimal example:
val dataDS = df
.select(
$"x".cast(StringType),
$"y".cast(StringType)
)
.toJSON
.withColumn("key", lit("keyname"))
dataDS
.write
.format("kafka")
.option("kafka.bootstrap.servers", "servername:port")
.option("topic", "topicname")
.save()
Remember you need the spark-sql-kafka library: e.g. for spark-shell is loaded with
spark-shell --packages "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1"
You can make use of the to_json SQL function to convert your columns into a JSON.
See Scala code below which is also making use of the Spark SQL built-in function struct in Spark version 2.4.5. Just make sure that your are naming your columns as key and value or applying corresponding aliases in your selectExpr.
import org.apache.spark.sql.functions.{col, struct, to_json}
import org.apache.spark.sql.SparkSession
object Main extends App {
val spark = SparkSession.builder()
.appName("myAppName")
.master("local[*]")
.getOrCreate()
// create DataFrame
import spark.implicits._
val df = Seq((3, "Alice"), (5, "Bob")).toDF("age", "name")
df.show(false)
// convert columns into json string
val df2 = df.select(col("name"),to_json(struct($"*"))).toDF("key", "value")
df2.show(false)
// +-----+------------------------+
// |key |value |
// +-----+------------------------+
// |Alice|{"age":3,"name":"Alice"}|
// |Bob |{"age":5,"name":"Bob"} |
// +-----+------------------------+
// write to Kafka with jsonString as value
df2.selectExpr("key", "value")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test-topic")
.save()
}
This will return the following data into your Kafka topic:
kafka-console-consumer --bootstrap-server localhost:9092 --property print.key=true --property print.value=true --topic test-topic
Alice {"age":3,"name":"Alice"}
Bob {"age":5,"name":"Bob"}
You can use toJSON() method on dataframe to convert your record to json message.
df = spark.createDataFrame([('user_first_name','user_last_nmae',100)], ['first_name','last_name','ID'])
import json
from datetime import datetime
from pyspark.sql.functions import lit
json_df = json.loads(df.withColumn('date_as_key', lit(datetime.now().date())).toJSON().first())
print json_df
{u'date_as_key': u'2019-07-29', u'first_name': u'user_first_name', u'last_name': u'user_last_nmae', u'ID': 100}
Your code may be like this
from pyspark.sql.functions import lit
df.withColumn('key', lit(datetime.now())).toJSON()
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
Scala:
import org.apache.spark.sql.Column;
someDF.withColumn("key",lit("name")).show() // replace "name" with your variable
someDF.withColumn("key",lit("name")).toJSON.first() // toJSON is available as variable on dataframe in Scala
someDF.withColumn("key",lit("name")).toJSON.first()
res5: String = {"number":8,"word":"bat","key":"name"}

Spark : Kafka consumer getting data as base64 encoded strings even though Producer does not explictly encode

I am trying out a simple example to publish data to Kafka and consume it using Spark.
Here is the Producer code:
var kafka_input = spark.sql("""
SELECT CAST(Id AS STRING) as key,
to_json(
named_struct(
'Id', Id,
'Title',Title
)
) as value
FROM offer_data""")
kafka_input.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("topic", topicName)
.save()
I verified that kafka_inputhas json string for value and the a number casted as string for key.
Here is the consumer code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.take(50)
display(df)
The data I receive on the consumer side is base64 encoded string.
How do I decode the value in Scala?
Also, this read statement is not flushing these records from the Kafka queue. I am assuming this is because I am not sending any ack signal back to Kafka. IS that correct? If so, how do I do that?
try this..
df.foreach(row => {
val key = row.getAs[Array[Byte]]("key")
val value = row.getAs[Array[Byte]]("value")
println(scala.io.Source.fromBytes(key,"UTF-8").mkString)
println(scala.io.Source.fromBytes(value,"UTF-8").mkString)
})
The problem was with my usage of SelectExpr..It does nto do an in-place transofrmation..it returns the transformed data.
Fix:
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
display(df1)

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")