Messages not loading into SilverTable from Topic - scala

Trying to load messages from Topic into a silverTable in the WriteStream. But the messages are not loading into silverTable. How to read the messages into silverTable?
var df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "10.19.9.4:1111")
.option("subscribe", "testTopic")
.load()
df = df.select($"value",$"topic")
// select the avro encoded value and the topic name from the topic
df.writeStream
.foreachBatch( (batchDF: DataFrame, batchId: Long)=>
{
batchDF.persist()
val topics = batchDF.select("topic").distinct().collect().map(
(row)=>row.getString(0))
topics.foreach((topix)=>{
val silverTable = mappings(topix)
// filter out message for the current topic
var writeDF = batchDF.where(s"topic = '${topix}'")
// decode the avro records to a spark struct
val schemaReg = schemaRegistryMappings(topix)
writeDF = writeDF.withColumn("avroconverted",
from_avro($"value", topix+"-value", schemaReg))
// append to the sliver table
writeDF.write.format("delta").mode("append").saveAsTable("silverTable")
})
}

Related

Kafka Unit testing

I have a function kafkaIngestion which creates a df from kafkatopic in the following way:
def kafkaIngestion(spark:sparksession):Dataframe = {
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("group.id", grpid)
.load()
.selectExpr("cast(value as string) as data")
.select(from_json($"data", schema=inputSchema)
.as("data")
.select("data.*")
df
}
I am unable to mock the the code to return my expected df. What's the correct way to mock the df?

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

spark streaming join kafka topics

We have two InputDStream from two Kafka topics, but we have to join the data of these two input together.
The problem is that each InputDStream is processed independently, because of the foreachRDD, nothing can be returned, to join after.
var Message1ListBuffer = new ListBuffer[Message1]
var Message2ListBuffer = new ListBuffer[Message2]
inputDStream1.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map({ msg =>
val r = msg.value()
val avro = AvroUtils.objectToAvro(r.getSchema, r)
val messageValue = AvroInputStream.json[FMessage1](avro.getBytes("UTF-8")).singleEntity.get
Message1ListBuffer = Message1FlatMapper.flatmap(messageValue)
Message1ListBuffer
})
inputDStream1.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
})
inputDStream2.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map({ msg =>
val r = msg.value()
val avro = AvroUtils.objectToAvro(r.getSchema, r)
val messageValue = AvroInputStream.json[FMessage2](avro.getBytes("UTF-8")).singleEntity.get
Message2ListBuffer = Message1FlatMapper.flatmap(messageValue)
Message2ListBuffer
})
inputDStream2.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
})
I thought I could return Message1ListBuffer and Message2ListBuffer, turn them into dataframes and join them. But that does not work, and I do not think it's the best choice
From there, what is the way to return the rdd of each foreachRDD in order to make a join?
inputDStream1.foreachRDD(rdd => {
})
inputDStream2.foreachRDD(rdd => {
})
Not sure about the Spark version you are using, with Spark 2.3+, it can be achieved directly.
With Spark >= 2.3
Subscribe to 2 topics you want to join
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic1")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load
val ds2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic2")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load
Format the subscribed messages in both streams
val stream1 = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val stream2 = ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Join both the streams
resultStream = stream1.join(stream2)
more join operations here
Warning:
Delay records will not get a join match. Need to tweak buffer a bit. more information found here

Spark : Kafka consumer getting data as base64 encoded strings even though Producer does not explictly encode

I am trying out a simple example to publish data to Kafka and consume it using Spark.
Here is the Producer code:
var kafka_input = spark.sql("""
SELECT CAST(Id AS STRING) as key,
to_json(
named_struct(
'Id', Id,
'Title',Title
)
) as value
FROM offer_data""")
kafka_input.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("topic", topicName)
.save()
I verified that kafka_inputhas json string for value and the a number casted as string for key.
Here is the consumer code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.take(50)
display(df)
The data I receive on the consumer side is base64 encoded string.
How do I decode the value in Scala?
Also, this read statement is not flushing these records from the Kafka queue. I am assuming this is because I am not sending any ack signal back to Kafka. IS that correct? If so, how do I do that?
try this..
df.foreach(row => {
val key = row.getAs[Array[Byte]]("key")
val value = row.getAs[Array[Byte]]("value")
println(scala.io.Source.fromBytes(key,"UTF-8").mkString)
println(scala.io.Source.fromBytes(value,"UTF-8").mkString)
})
The problem was with my usage of SelectExpr..It does nto do an in-place transofrmation..it returns the transformed data.
Fix:
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
display(df1)

Writing sorted dataframe into kafka topic (sorted order rows) in spark structured streaming using scala

while displaying sorting results to console results are showing as expected in sorting order, but when i push those results to kafka topic the sorting order is missing
def main(args: Array[String]) = {
//Spark config and kafka config
// load method
val Raw_df = readStream(sparkSession, inputtopic)
//converting read kafka mesages into json format
val df_messages = Raw_df.selectExpr("CAST(value AS STRING)")
.withColumn("data", from_json($"value", my_schema))
.select("data.*")
val win = window($"date_column","5 minutes")
val modified_df = df_messages.withWatermark("date_column", "3 minutes")
.groupBy(win,$"All_colums", $"date_column")
.count()
.orderBy(asc("date_column"),asc("column_5"))
val finalcol = modified_df.drop("count").drop("window")
//mapping all columsn and converting them to json mesages
val finalcolonames = my_schema.fields.map(z => z.name)
val dataset_Json = finalcol.withColumn("value", to_json(struct(finalcolonames.map(y => col(y)): _*)))
.select($"value")
//val query = writeToKafkaStremoutput(dataset_Json, outputtopic,checkpointlocation)
val query = writeToConsole(order)
(query)
}
//below method write data to kafka topic
def writeToKafkaStremoutput(dataFrame: DataFrame, Config: Config, topic: String,checkpointlocation:String) = {
dataFrame
.selectExpr( "CAST(value AS STRING)")
.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime("1 second"))
.option("topic", topic)
.option("kafka.bootstrap.servers", "kafka.bootstrap_servers")
.option("checkpointLocation",checkpointPath)
.outputMode(OutputMode.Complete())
.start()
}
//console op for testing
// below method write data toconsole
def writeToConsole(dataFrame: DataFrame) = {
import org.apache.spark.sql.streaming.Trigger
val query = dataFrame
.writeStream
.format("console")
.option("numRows",300)
//.trigger(Trigger.ProcessingTime("20 second"))
.outputMode(OutputMode.Complete())
.option("truncate", false)
.start()
query
}