How to Deseralize Avro response getting from Datastream Scala + apache Flink - scala

I am Getting Avro Response from a Kafka Topic from Confluent and i am facing issues when i want to deseralize the response. Not Understanding the Syntax How i should define the Avro deserializer and use in my Kafka Source while reading.
Sharing the approach i am currently doing.
I have a topic In Confluent named employee which is producing message every 10 seconds and each message is seralized by avro schema registry in the Confluent.
I am trying to Read those messages in my scala program I was able to print the serialised messages in the code but not able to deserialize the messaged.
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.connector.kafka.source.KafkaSource
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer
import org.apache.flink.formats.avro.AvroDeserializationSchema
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import java.time.Duration
case class emp(
name: String,
age: Int,
)
object Main {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val schemaRegistryUrl = "http://localhost:8081"
val source = KafkaSource.builder[String].
setBootstrapServers("localhost:9092")
.setTopics("employee")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.build
val streamEnv : DataStream[String] =
env.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "Kafka Source")
streamEnv.print()
env.execute("Example")
}
}
I tried the Approach of Defining the Avro deserializer in kafka source while reading
.setValueOnlyDeserializer(new AvroDeserializationSchema[emp](classOf[emp])
Had no luck in the above approach as well.

Rather than a AvroDeserializationSchema, you need to use a ConfluentRegistryAvroDeserializationSchema instead. The standard Avro deserializer doesn't understand what to do with the magic byte that the Confluent serializer includes.

Related

Spark 3.2.0 Structured Streaming save data to Kafka with Confluent Schema Registry

Is there some easy way how to save a spark structured streaming dataframe into kafka with Confluent Schema registry? Spark version is 3.2.0, Scala 2.12
I managed to read data from Kafka with Confluent schema registry with a bit of an ugly code:
val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistry, 128)
val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
val deserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes))```
Now I would like to write the data to another Kafka topic - is there a simple way?
You'd need to use similarly ugly code that uses a serializer UDF over a Struct column (or primitive type).
There's libraries that can help with making it less ugly - https://github.com/AbsaOSS/ABRiS

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

What FileOutputCommitter should be used in when writing AVRO files in Spark?

When saving an RDD to S3 in AVRO, I get the following warning in the console:
Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
I haven't been able to find a simple implicit such as saveAsAvroFile and therefore I've dug around and came to this:
import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
object AvroUtil {
def write[T](
path: String,
schema: Schema,
avroRdd: RDD[T],
job: Job = Job.getInstance()): Unit = {
val intermediateRdd = avroRdd.mapPartitions(
f = (iter: Iterator[T]) => iter.map(new AvroKey(_) -> NullWritable.get()),
preservesPartitioning = true
)
job.getConfiguration.set("avro.output.codec", "snappy")
job.getConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
AvroJob.setOutputKeySchema(job, schema)
intermediateRdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration
)
}
}
I'm rather baffled as I don't see what is incorrect because the AVRO files seem to be outputted correctly.
You can override behaviour of existing FileOutputCommitter by implementing own OutputFileCommitter to make it more efficient and safe.
Follow this link where author has explained similar with example.

How to get a Map [String, String] returned by a Kafka Consumer (Alpakka)?

I should get a Map [String, String] back from a Kafka Consumer, but I don't really know how. I managed to configure the consumer, it works fine, but I don't understand how I could get the Map.
implicit val system: ActorSystem = ActorSystem()
val consumerConfig = system.settings.config.getConfig("akka.kafka.consumer")
val = kafkaConsumerSettings =
ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)
.withBootstrapServers(localhost:9094)
.withGroupId(group1)
Consumer
.plainSource(kafkaConsumerSettings, Subscriptions.topics(entity.entity_name))
.toMat(Sink.foreach(println))(DrainingControl.apply)
.run()
Lightbend's recommendation is to deal with byte arrays while deserializing incoming data from Kafka
The general recommendation for de-/serialization of messages is to use byte arrays (or Strings) as value and do the de-/serialization in a map operation in the Akka Stream instead of implementing it directly in Kafka de-/serializers. When deserialization is handled explicitly within the Akka Stream, it is easier to implement the desired error handling strategy as the examples below show.
To do so, you may setup a consumer using this setting:
val consumerSettings = ConsumerSettings(consumerConfig, new StringDeserializer, new ByteArrayDeserializer)
And get the results by calling the .value() method from your Record class. To deserialize it, i would recommend using circe + jawn. This code should do the trick.
import io.circe.jawn
import io.circe.generic.auto._
val bytes = record.value()
val data = jawn.parseByteBuffer(ByteBuffer.wrap(bytes)).flatMap(_.as[Map[String, String]])

Simple Spark Scala Post to External Rest API Example

New to Spark Scala, I just want to read a json file and post the content to an external rest api server. Can anyone provide a simple example? or provide guidelines?
You probably do not want to use Spark for this. Spark is an analytical engine for processing large amounts of data - unless you're reading in massive amounts of json from hdfs, this task is more suitable for scala. You should look up ways to read a json file in scala, and send that content to a server in scala.
Here are some great places to get started:
Scala Read JSON file
https://alvinalexander.com/scala/how-to-send-json-post-data-to-restful-url-in-scala
The following is from the above URL:
import java.io._
import org.apache.commons._
import org.apache.http._
import org.apache.http.client._
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.DefaultHttpClient
import java.util.ArrayList
import org.apache.http.message.BasicNameValuePair
import org.apache.http.client.entity.UrlEncodedFormEntity
import com.google.gson.Gson
case class Person(firstName: String, lastName: String, age: Int)
object HttpJsonPostTest extends App {
// create our object as a json string
val spock = new Person("Leonard", "Nimoy", 82)
val spockAsJson = new Gson().toJson(spock)
// add name value pairs to a post object
val post = new HttpPost("http://localhost:8080/posttest")
val nameValuePairs = new ArrayList[NameValuePair]()
nameValuePairs.add(new BasicNameValuePair("JSON", spockAsJson))
post.setEntity(new UrlEncodedFormEntity(nameValuePairs))
// send the post request
val client = new DefaultHttpClient
val response = client.execute(post)
println("--- HEADERS ---")
response.getAllHeaders.foreach(arg => println(arg))
}