I changed a line from createStream to createDirectStream since the new library does not support createStream
I have checked it from here https://codewithgowtham.blogspot.com/2022/02/spark-streaming-kafka-cassandra-end-to.html
scala> val lines = KafkaUtils.createDirectStream(ssc, "localhost:2181", "sparkgroup", topicpMap).map(_._2)
<console>:44: error: overloaded method value createDirectStream with alternatives:
[K, V](jssc: org.apache.spark.streaming.api.java.JavaStreamingContext, locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy, consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[K,V], perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.api.java.JavaInputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[K,V]] <and>
[K, V](ssc: org.apache.spark.streaming.StreamingContext, locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy, consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[K,V], perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.dstream.InputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[K,V]]
cannot be applied to (org.apache.spark.streaming.StreamingContext, String, String, scala.collection.immutable.Map[String,Int])
val lines = KafkaUtils.createDirectStream(ssc, "localhost:2181", "sparkgroup", topicpMap).map(_._2)
It's already 2022nd - there should be a very specific reason for using legacy Spark Streaming. Instead you need to use Spark Structured Streaming that is much more easier to use than legacy one. With it, work with Kafka is very simple:
// create stream
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
// Decode payload - it heavily depends on the data format in the Kafka
val decoded = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
You can use the same APIs for working with both streaming & batch data.
Related
Currently I am using Abris library to de-serialize Confluent Avro messages getting from KAFKA and it works well when topic has only messages with one version of schema as soon as topic has data with different versions it start giving me malformed data found error which is obvious because while creating the config I am passing the SchemaManager.PARAM_VALUE_SCHEMA_ID=-> "latest"
But my questions is how to know the schema Id at run time basically for each record and then pass it to the Abris config here is the sample code:
Spark version: Spark 2.4.0
Scala :2.11.12
Abris:5.0.0
def getTopicSchemaMap(topicNm: String): Map[String, String] = {
Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> topicNm,
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegUrl,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.name",
SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")
}
val kafkaDataFrameRaw = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("subscribe", topics)
.option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", false)
.load()
val df= kafkaDataFrameRaw.select(
from_confluent_avro(col("value"), getTopicSchemaMap(topicNm)) as 'value, col("offset").as("offsets"))
In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")
I am trying out a simple example to publish data to Kafka and consume it using Spark.
Here is the Producer code:
var kafka_input = spark.sql("""
SELECT CAST(Id AS STRING) as key,
to_json(
named_struct(
'Id', Id,
'Title',Title
)
) as value
FROM offer_data""")
kafka_input.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("topic", topicName)
.save()
I verified that kafka_inputhas json string for value and the a number casted as string for key.
Here is the consumer code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.take(50)
display(df)
The data I receive on the consumer side is base64 encoded string.
How do I decode the value in Scala?
Also, this read statement is not flushing these records from the Kafka queue. I am assuming this is because I am not sending any ack signal back to Kafka. IS that correct? If so, how do I do that?
try this..
df.foreach(row => {
val key = row.getAs[Array[Byte]]("key")
val value = row.getAs[Array[Byte]]("value")
println(scala.io.Source.fromBytes(key,"UTF-8").mkString)
println(scala.io.Source.fromBytes(value,"UTF-8").mkString)
})
The problem was with my usage of SelectExpr..It does nto do an in-place transofrmation..it returns the transformed data.
Fix:
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
display(df1)
import spark.implicits._
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load()
.as[KafkaMessage]
.select($"value".as[Array[Byte]])
.map(msg=>{
val byteArrayInputStream = new ByteArrayInputStream(msg)
val datumReader:DatumReader[GenericRecord] = new SpecificDatumReader[GenericRecord](messageSchema)
val dataFileReader:DataFileStream[GenericRecord] = new DataFileStream[GenericRecord](byteArrayInputStream, datumReader)
while(dataFileReader.hasNext) {
val userData1: GenericRecord = dataFileReader.next()
userData1.asInstanceOf[org.apache.avro.util.Utf8].toString
}
})
Error:
Error:(49, 9) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.map(msg=>{
Whenever you are trying to to do map/transformations on dataset in structured streaming, it is expected to associate with an appropriate encoder.
Tor primitive data types, implicit encoders are provided by spark:
import spark.implicits._
Where as for other types you need to provide manually.
So here you can either use implicit encoders:
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) = org.apache.spark.sql.Encoders.kryo[A](ct)
... or you can define your encoder which is associated with the data being processed within map function.
I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")