Spark : Kafka consumer getting data as base64 encoded strings even though Producer does not explictly encode - scala

I am trying out a simple example to publish data to Kafka and consume it using Spark.
Here is the Producer code:
var kafka_input = spark.sql("""
SELECT CAST(Id AS STRING) as key,
to_json(
named_struct(
'Id', Id,
'Title',Title
)
) as value
FROM offer_data""")
kafka_input.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("topic", topicName)
.save()
I verified that kafka_inputhas json string for value and the a number casted as string for key.
Here is the consumer code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.take(50)
display(df)
The data I receive on the consumer side is base64 encoded string.
How do I decode the value in Scala?
Also, this read statement is not flushing these records from the Kafka queue. I am assuming this is because I am not sending any ack signal back to Kafka. IS that correct? If so, how do I do that?

try this..
df.foreach(row => {
val key = row.getAs[Array[Byte]]("key")
val value = row.getAs[Array[Byte]]("value")
println(scala.io.Source.fromBytes(key,"UTF-8").mkString)
println(scala.io.Source.fromBytes(value,"UTF-8").mkString)
})

The problem was with my usage of SelectExpr..It does nto do an in-place transofrmation..it returns the transformed data.
Fix:
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
display(df1)

Related

kafkaUtils.createDirectStream gives an error

I changed a line from createStream to createDirectStream since the new library does not support createStream
I have checked it from here https://codewithgowtham.blogspot.com/2022/02/spark-streaming-kafka-cassandra-end-to.html
scala> val lines = KafkaUtils.createDirectStream(ssc, "localhost:2181", "sparkgroup", topicpMap).map(_._2)
<console>:44: error: overloaded method value createDirectStream with alternatives:
[K, V](jssc: org.apache.spark.streaming.api.java.JavaStreamingContext, locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy, consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[K,V], perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.api.java.JavaInputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[K,V]] <and>
[K, V](ssc: org.apache.spark.streaming.StreamingContext, locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy, consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[K,V], perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.dstream.InputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[K,V]]
cannot be applied to (org.apache.spark.streaming.StreamingContext, String, String, scala.collection.immutable.Map[String,Int])
val lines = KafkaUtils.createDirectStream(ssc, "localhost:2181", "sparkgroup", topicpMap).map(_._2)
It's already 2022nd - there should be a very specific reason for using legacy Spark Streaming. Instead you need to use Spark Structured Streaming that is much more easier to use than legacy one. With it, work with Kafka is very simple:
// create stream
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
// Decode payload - it heavily depends on the data format in the Kafka
val decoded = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
You can use the same APIs for working with both streaming & batch data.

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

writeStream() is printing null values in batches data even i supply proper json data in kafka through writeStream()

I am trying to convert json using schema and printing values to console, but writeStream() is printing null values in all columns even i gave proper data.
data i am giving to kafka topic ..
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,quantity:27,"loss":0,"gender":"M"}
Below is my scala code..
val readStreamDFInd = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "IndiaStocks")
.option("startingOffsets", "earliest")
.load()
//readStreamDFInd.printSchema()
val readStreamDFUS = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "USStocks")
.option("startingOffsets", "earliest")
.load()
val schema = new StructType()
.add("stock", StringType)
.add("buy", IntegerType)
.add("sell", IntegerType)
.add("profit", IntegerType)
.add("quantity", IntegerType)
.add("loss", IntegerType)
.add("gender", StringType)
val stocksIndia = readStreamDFInd.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
val stocksUSA = readStreamDFUS.selectExpr("CAST(value as STRING) as json").select(from_json($"json", schema).as("data")).select("data.*")
stocksIndia.printSchema() stocksUSA.writeStream
.format("console")
.outputMode("append").trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
The code works fine as you can also see in the book.
Looking at the documentation of the from_json function the null values are created because the string is unparseable.
=> You are missing the quotations around the quantity field in your json string.
Problem is in your kafka data, quantity column should be in quotes. Please check below.
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}
{"stock":"SEE","buy":12,"sell":15,"profit":3,"quantity":27,"loss":0,"gender":"M"}

How to parse a json string column in pyspark's DataStreamReader and create a Data Frame

I am reading messages from a kafka topic
messageDFRaw = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "test-message")\
.load()
messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as dict")
When I print the data frame from the above query I get the below console output.
|key|dict|
|#badbunny |{"channel": "#badbunny", "username": "mgat22", "message": "cool"}|
How can I create a data frame from the DataStreamReader such that I have a dataframe with columns as |key|channel| username| message|
I tried following the accepted answer in How to read records in JSON format from Kafka using Structured Streaming?
struct = StructType([
StructField("channel", StringType()),
StructField("username", StringType()),
StructField("message", StringType()),
])
messageDFRaw.select(from_json("CAST(value AS STRING)", struct))
but, I get Expected type 'StructField', got 'StructType' instead in from_json()
I ignored the warning Expected type 'StructField', got 'StructType' instead in from_json().
However, I had to cast the value from kafka message initially and then parse the json schema later.
messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
messageParsedDF = messageDF.select(from_json("value", struct_schema).alias("message"))
messageFlattenedDF = messageParsedDF.selectExpr("value.channel", "value.username", "value.message")

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")