Spark Dataframe write to kafka topic in avro format? - scala

I have a Dataframe in Spark that looks like
eventDF
Sno|UserID|TypeExp
1|JAS123|MOVIE
2|ASP123|GAMES
3|JAS123|CLOTHING
4|DPS123|MOVIE
5|DPS123|CLOTHING
6|ASP123|MEDICAL
7|JAS123|OTH
8|POQ133|MEDICAL
.......
10000|DPS123|OTH
I need to write it to Kafka topic in Avro format
currently i am able to write in Kafka as JSON using following code
val kafkaUserDF: DataFrame = eventDF.select(to_json(struct(eventDF.columns.map(column):_*)).alias("value"))
kafkaUserDF.selectExpr("CAST(value AS STRING)").write.format("kafka")
.option("kafka.bootstrap.servers", "Host:port")
.option("topic", "eventdf")
.save()
Now I want to write this in Avro format to Kafka topic

Spark >= 2.4:
You can use to_avro function from spark-avro library.
import org.apache.spark.sql.avro._
eventDF.select(
to_avro(struct(eventDF.columns.map(column):_*)).alias("value")
)
Spark < 2.4
You have to do it the same way:
Create a function which writes serialized Avro record to ByteArrayOutputStream and return the result. A naive implementation (this supports only flat objects) could be similar to (adopted from Kafka Avro Scala Example by Sushil Kumar Singh)
import org.apache.spark.sql.Row
def encode(schema: org.apache.avro.Schema)(row: Row): Array[Byte] = {
val gr: GenericRecord = new GenericData.Record(schema)
row.schema.fieldNames.foreach(name => gr.put(name, row.getAs(name)))
val writer = new SpecificDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(gr, encoder)
encoder.flush()
out.close()
out.toByteArray()
}
Convert it to udf:
import org.apache.spark.sql.functions.udf
val schema: org.apache.avro.Schema
val encodeUDF = udf(encode(schema) _)
Use it as drop in replacement for to_json
eventDF.select(
encodeUDF(struct(eventDF.columns.map(column):_*)).alias("value")
)

Related

How to add Header to Avro Kafka Message

We are using Avro Datum Reader and Datum Writer to build Kafka messages in Scala.
Code :
def AvroKafkaMessage(schemaPath : String, dataPath: String): Array[Byte] =
{
val schema = Source.fromFile(schemaPath).mkString
val schemaObj = new Schema.Parser().parse(schema)
val reader= new GenericDatumReader[GenericRecord](schemaObj)
val dataFile = new File(dataPath)
val dataFileReader = new DataFileReader[GenericRecord](dataFile, reader)
val datum = dataFileReader.next()
val writer = new SpecificDatumWriter[GenericRecord](schemaObj)
val out = new ByteArrayOutputStream()
val encoder : BinaryEncoder= EncoderFactory.get().binaryEncoder(out, null)
writer.write(datum,encoder)
encoder.flush()
out.close()
out.toByteArray()
}
Since there would we multiple events per kafka topic, we need to add header to avro messages for unit testing.
How to add headers in avro file and produce kakfa messages ?
Spark dataframes need their own column for Kafka headers. They must exist in a specific format of Array[(String, Array[Byte])]. Avro doesn't particularly matter;your shown function returns a byte array, so add that to a row/column of the dataframe you wish to write to Kafka.
If you have an existing Avro file you want to produce to Kafka, use Spark's existing from_avro function

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

Unable to find encoder for type stored in a Dataset for streaming mongo db data through Kafka

I want to tail Mongo oplog and stream it through Kafka. So I found debezium Kafka CDC connector which tails the Mongo oplog with their in-build serialisation technique.
Schema registry uses below convertor for the serialization,
'key.converter=io.confluent.connect.avro.AvroConverter' and
'value.converter=io.confluent.connect.avro.AvroConverter'
Below are the library dependencies I'm using in the project
libraryDependencies += "io.confluent" % "kafka-avro-serializer" % "3.1.2"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % "0.10.2.0
Below is the streaming code which deserialize Avro data
import org.apache.spark.sql.{Dataset, SparkSession}
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
import scala.collection.JavaConverters._
object KafkaStream{
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder
.master("local")
.appName("kafka")
.getOrCreate()
sparkSession.sparkContext.setLogLevel("ERROR")
import sparkSession.implicits._
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "productCollection.inventory.Product"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
rawTopicMessageDF.printSchema()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("key"), keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueJsonString = valueDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("value"), topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueJsonString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Kafka consumer running fine I can see the data tailing from the oplog however when I run above code I'm getting below errors,
Error:(70, 59) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
and
Error:(70, 59) not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[DeserializedFromKafkaRecord])org.apache.spark.sql.Dataset[DeserializedFromKafkaRecord].
Unspecified value parameter evidence$7.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
Please suggest what I'm missing here.
Thanks in advance.
Just declare your case class DeserializedFromKafkaRecord outside of the main method.
I imagine that when the case class is defined inside main, Spark magic with implicit encoders does not work properly, since the case class does not exist before the execution of main method.
The problem can be reproduced with a simpler example (without Kafka) :
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object SimpleTest {
// declare CaseClass outside of main method
case class CaseClass(value: Int)
def main(args: Array[String]): Unit = {
// when case class is declared here instead
// of outside main, the program does not compile
// case class CaseClass(value: Int)
val sparkSession = SparkSession
.builder
.master("local")
.appName("simpletest")
.getOrCreate()
import sparkSession.implicits._
val df: DataFrame = sparkSession.sparkContext.parallelize(1 to 10).toDF()
val ds: Dataset[CaseClass] = df.map { row =>
CaseClass(row.getInt(0))
}
ds.show()
}
}

How to define schema of streaming dataset dynamically to write to csv?

I have a streaming dataset, reading from kafka and trying to write to CSV
case class Event(map: Map[String,String])
def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation
val eventDataset: Dataset[Event] = spark
.readStream
.format("kafka")
.load()
.select("value")
.as[Array[Byte]]
.map(decodeEvent)
Event holds Map[String,String] inside and to write to CSV I'll need some schema.
Let's say all the fields are of type String and so I tried the example from spark repo
val columns = List("year","month","date","topic","field1","field2")
val schema = new StructType() //Prepare schema programmatically
columns.foreach { field => schema.add(field, "string") }
val rowRdd = eventDataset.rdd.map { event => Row.fromSeq(
columns.map(c => event.getOrElse(c, "")
)}
val df = spark.sqlContext.createDataFrame(rowRdd, schema)
This gives error at runtime on line "eventDataset.rdd":
Caused by: org.apache.spark.sql.AnalysisException: Queries with
streaming sources must be executed with writeStream.start();;
Below doesn't work because '.map' has a List[String] not Tuple
eventDataset.map(event => columns.map(c => event.getOrElse(c,""))
.toDF(columns:_*)
Is there a way to achieve this with programmatic schema and structured streaming datasets?
I'd use much simpler approach:
import org.apache.spark.sql.functions._
eventDataset.select(columns.map(
c => coalesce($"map".getItem(c), lit("")).alias(c)
): _*).writeStream.format("csv").start(path)
but if you want something closer to the current solution skip RDD conversion
import org.apache.spark.sql.catalyst.encoders.RowEncoder
eventDataset.rdd.map(event =>
Row.fromSeq(columns.map(c => event.getOrElse(c,"")))
)(RowEncoder(schema)).writeStream.format("csv").start(path)

Saving DataStream data into MongoDB / converting DS to DF

I am able to save a Data Frame to mongoDB but my program in spark streaming gives a datastream ( kafkaStream ) and I am not able to save it in mongodb neither i am able to convert this datastream to a dataframe. Is there any library or method to do this? Any inputs are highly appreciated.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaSparkStream {
def main(args: Array[String]){
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaStream = KafkaUtils.createStream(ssc,
"localhost:2181","spark-streaming-consumer-group", Map("topic" -> 25))
kafkaStream.print()
ssc.start()
ssc.awaitTermination()
}
}
Save a DF to mongodb - SUCCESS
val mongoDbFormat = "com.stratio.datasource.mongodb"
val mongoDbDatabase = "mongodatabase"
val mongoDbCollection = "mongodf"
val mongoDbOptions = Map(
MongodbConfig.Host -> "localhost:27017",
MongodbConfig.Database -> mongoDbDatabase,
MongodbConfig.Collection -> mongoDbCollection
)
//with DataFrame methods
dataFrame.write
.format(mongoDbFormat)
.mode(SaveMode.Append)
.options(mongoDbOptions)
.save()
Access the underlying RDD from the DStream using foreachRDD, transform it to a DataFrame and use your DF function on it.
The easiest way to transform an RDD to a DataFrame is by first transforming the data into a schema, represented in Scala by a case class
case class Element(...)
val elementDStream = kafkaDStream.map(entry => Element(entry, ...))
elementDStream.foreachRDD{rdd =>
val df = rdd.toDF
df.write(...)
}
Also, watch out for Spark 2.0 where this process will completely change with the introduction of Structured Streaming, where a MongoDB connection will become a sink.