Spark 3.2.0 Structured Streaming save data to Kafka with Confluent Schema Registry - scala

Is there some easy way how to save a spark structured streaming dataframe into kafka with Confluent Schema registry? Spark version is 3.2.0, Scala 2.12
I managed to read data from Kafka with Confluent schema registry with a bit of an ugly code:
val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistry, 128)
val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
val deserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes))```
Now I would like to write the data to another Kafka topic - is there a simple way?

You'd need to use similarly ugly code that uses a serializer UDF over a Struct column (or primitive type).
There's libraries that can help with making it less ugly - https://github.com/AbsaOSS/ABRiS

Related

How to add Header to Avro Kafka Message

We are using Avro Datum Reader and Datum Writer to build Kafka messages in Scala.
Code :
def AvroKafkaMessage(schemaPath : String, dataPath: String): Array[Byte] =
{
val schema = Source.fromFile(schemaPath).mkString
val schemaObj = new Schema.Parser().parse(schema)
val reader= new GenericDatumReader[GenericRecord](schemaObj)
val dataFile = new File(dataPath)
val dataFileReader = new DataFileReader[GenericRecord](dataFile, reader)
val datum = dataFileReader.next()
val writer = new SpecificDatumWriter[GenericRecord](schemaObj)
val out = new ByteArrayOutputStream()
val encoder : BinaryEncoder= EncoderFactory.get().binaryEncoder(out, null)
writer.write(datum,encoder)
encoder.flush()
out.close()
out.toByteArray()
}
Since there would we multiple events per kafka topic, we need to add header to avro messages for unit testing.
How to add headers in avro file and produce kakfa messages ?
Spark dataframes need their own column for Kafka headers. They must exist in a specific format of Array[(String, Array[Byte])]. Avro doesn't particularly matter;your shown function returns a byte array, so add that to a row/column of the dataframe you wish to write to Kafka.
If you have an existing Avro file you want to produce to Kafka, use Spark's existing from_avro function

How to Deseralize Avro response getting from Datastream Scala + apache Flink

I am Getting Avro Response from a Kafka Topic from Confluent and i am facing issues when i want to deseralize the response. Not Understanding the Syntax How i should define the Avro deserializer and use in my Kafka Source while reading.
Sharing the approach i am currently doing.
I have a topic In Confluent named employee which is producing message every 10 seconds and each message is seralized by avro schema registry in the Confluent.
I am trying to Read those messages in my scala program I was able to print the serialised messages in the code but not able to deserialize the messaged.
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.connector.kafka.source.KafkaSource
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer
import org.apache.flink.formats.avro.AvroDeserializationSchema
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import java.time.Duration
case class emp(
name: String,
age: Int,
)
object Main {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val schemaRegistryUrl = "http://localhost:8081"
val source = KafkaSource.builder[String].
setBootstrapServers("localhost:9092")
.setTopics("employee")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.build
val streamEnv : DataStream[String] =
env.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "Kafka Source")
streamEnv.print()
env.execute("Example")
}
}
I tried the Approach of Defining the Avro deserializer in kafka source while reading
.setValueOnlyDeserializer(new AvroDeserializationSchema[emp](classOf[emp])
Had no luck in the above approach as well.
Rather than a AvroDeserializationSchema, you need to use a ConfluentRegistryAvroDeserializationSchema instead. The standard Avro deserializer doesn't understand what to do with the magic byte that the Confluent serializer includes.

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

Avro support in Flink - scala

How to read avro from Flink in scala?
Is it the same for batch/stream/table: StreamExecutionEnvironment/ ExecutionEnvironment / TableEnvironment?
would it be sth like: val custTS: TableSource = new AvroInputFormat("/path/to/file", ...)
Below is java avro implementation ref (connectors), but can't find scala ref anywhere:
AvroInputFormat<User> users = new AvroInputFormat<User>(in, User.class);
DataSet<User> usersDS = env.createInput(users);
You can use Flink's InputFormats, including the AvroInputFormat, from the Java as well as the Scala API:
Streaming & batch: val avroInputStream = env.createInput(new AvroInputFormat[User](in, classOf[User]))
Table API: tableEnv.registerTable("table", avroInputStream.toTable)

How to write ElasticsearchSink for Spark Structured streaming

I'm using Spark structured streaming to process high volume data from Kafka queue and doing some heaving ML computation but I need to write the result to Elasticsearch.
I tried using the ForeachWriter but can't get a SparkContext inside it, the other option probably is to do HTTP Post inside the ForeachWriter.
Right now, am thinking of writing my own ElasticsearchSink.
Is there any documentation out there to create a Sink for Spark Structured streaming ?
If you are using Spark 2.2+ and ES 6.x then there is a ES sink out of the box:
df
.writeStream
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "mappingId")
.start("index/type") // index/type
If you are using ES 5.x like I was you need to implement an EsSink and an EsSinkProvider:
EsSinkProvider:
class EsSinkProvider extends StreamSinkProvider with DataSourceRegister {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
EsSink(sqlContext, parameters, partitionColumns, outputMode)
}
override def shortName(): String = "my-es-sink"
}
EsSink:
case class ElasticSearchSink(sqlContext: SQLContext,
options: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode)
extends Sink {
override def addBatch(batchId: Long, df: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[String] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row]).map(_.getAs[String](0))
}
// from org.elasticsearch.spark.rdd library
EsSpark.saveJsonToEs(rdd, "index/type", Map("es.mapping.id" -> "mappingId"))
}
}
And then lastly, when writing the stream use this provider class as the format:
df
.writeStream
.queryName("ES-Writer")
.outputMode(OutputMode.Append())
.format("path.to.EsProvider")
.start()
You can take a look at ForeachSink. It shows how to implement a Sink and convert DataFrame to RDD (it's very tricky and has a large comment). However, please be aware that the Sink API is still private and immature, it might be changed in future.