How to use foreachPartition in Spark 2.2 to avoid Task Serialization error

How to use foreachPartition in Spark 2.2 to avoid Task Serialization error - scala

I have the following working code that uses Structured Streaming (Spark 2.2) in order to read data from Kafka (0.10).
The only issue that I cannot solve is related to Task serialization problem when using kafkaProducer inside ForeachWriter.
In my old version of this code developed for Spark 1.6 I was using foreachPartition and I was defining kafkaProducer for each partition to avoid Task Serialization problem.
How can I do it in Spark 2.2?
val df: Dataset[String] = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "true")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
.map(_._2)
var mySet = spark.sparkContext.broadcast(Map(
"metadataBrokerList"->metadataBrokerList,
"outputKafkaTopic"->outputKafkaTopic,
"batchSize"->batchSize,
"lingerMS"->lingerMS))
val kafkaProducer = Utils.createProducer(mySet.value("metadataBrokerList"),
mySet.value("batchSize"),
mySet.value("lingerMS"))
val writer = new ForeachWriter[String] {
override def process(row: String): Unit = {
// val result = ...
val record = new ProducerRecord[String, String](mySet.value("outputKafkaTopic"), "1", result);
kafkaProducer.send(record)
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
}
val query = df
.writeStream
.foreach(writer)
.start
query.awaitTermination()
spark.stop()

Write implementation of ForeachWriter and than use it. (Avoid anonymous classes with not serializable objects - in your case its ProducerRecord)
Example: val writer = new YourForeachWriter[String]
Also here is helpful article about Spark Serialization problems: https://www.cakesolutions.net/teamblogs/demystifying-spark-serialisation-error

Related

Histogram/Counter metrics for Prometheus in Spark Job readStream/writeStream (Kafka to parquet). How to send metric from LOOP or Event or Listener

How to send Histogram/Counter metrics for Prometheus from Spark job in:
Loop
foreachBatch
methods of ForeachWriter
Spark events
using org.apache.spark.metrics.source.Source in Spark job with stream?
I'm able to accumulate metrics in collection accumulator(s), but I cannot find context where I can send accumulated metrics without issue of compilation or execution.
Common issue:
22/11/28 14:24:36 ERROR MicroBatchExecution: Query [id = 5d2fc03c-1dbc-4bb1-a821-397586d22cf4, runId = e665dcd2-6e3d-4b03-8684-11844de040f0] terminated with error
org.apache.spark.SparkException: Task not serializable
or
Spark job is stopped in ~15 seconds on the spark worker after start with different variation of the error messages.
Found solution:
It works on local env. with simple spark-submit, but it doesn't work with the cluster. Collection returned by SparkEnv.get.metricsSystem.getSourcesByName is always empty.
https://gist.github.com/ambud/641f8fc25f7f8d3923d6fd10f64b7184
I see only doubted ways to fix this issue. I don't believe that there's no any common solution.
package org.apache.spark.metrics.source
import com.codahale.metrics.{Counter, Histogram, MetricRegistry}
class PrometheusMetricSource extends Source {
override val sourceName: String = "PrometheusMetricSource"
override val metricRegistry: MetricRegistry = new MetricRegistry
val myMetric: Histogram = metricRegistry.histogram(MetricRegistry.name("myMetric"))
}
import org.apache.spark.SparkEnv
import org.apache.spark.metrics.source.PrometheusMetricSource
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, SparkSession}
object Example {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("My Spark job").getOrCreate()
import spark.implicits._
val source: PrometheusMetricSource = new PrometheusMetricSource
SparkEnv.get.metricsSystem.registerSource(source)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "my-topic")
.option("startingOffsets", "earliest")
.load()
val ds: Dataset[String] =
df.select(col("value"))
.as[String]
.map { str =>
source.myMetric.update(1L) // submit metric ////////////////////////
str + "test"
}
ds.writeStream
.foreachBatch {
(batchDF: Dataset[String],
batchId: Long) =>
source.myMetric.update(1L) // submit metric ////////////////////////
}
.foreach(new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = true
def close(errorOrNull: Throwable): Unit = {}
def process(record: String) = {
source.myMetric.update(1L) // submit metric ////////////////////////
}
})
.outputMode("append")
.format("parquet")
.option("path", "/share/parquet")
.option("checkpointLocation", "/share/checkpoints")
.start()
.awaitTermination()
}
}

Kafka Unit testing

I have a function kafkaIngestion which creates a df from kafkatopic in the following way:
def kafkaIngestion(spark:sparksession):Dataframe = {
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("group.id", grpid)
.load()
.selectExpr("cast(value as string) as data")
.select(from_json($"data", schema=inputSchema)
.as("data")
.select("data.*")
df
}
I am unable to mock the the code to return my expected df. What's the correct way to mock the df?

enocder issue- Spark Structured streaming- works in repl only

I have a working process to ingest and deserialize kafka avro message using schema reg. It works great in the REPL but when I try to compile I get
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .map(x => {
I'm not sure if I need to modify my object, but why would I need to if the REPL works fine.
object AgentDeserializerWrapper {
val props = new Properties()
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL)
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true")
val vProps = new kafka.utils.VerifiableProperties(props)
val deser = new KafkaAvroDecoder(vProps)
val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(subjectValueNameAgentRead)
val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
}
case class DeserializedFromKafkaRecord( value: String)
import spark.implicits._
val agentStringDF = spark
.readStream
.format("kafka")
.option("subscribe", "agent")
.options(kafkaParams)
.load()
.map(x => {
DeserializedFromKafkaRecord(AgentDeserializerWrapper.deser.fromBytes(x.getAs[Array[Byte]]("value"), AgentDeserializerWrapper.messageSchema).asInstanceOf[GenericData.Record].toString)
})

Add as[DeserializedFromKafkaRecord], in order to type statically your dataset:
val agentStringDF = spark
.readStream
.format("kafka")
.option("subscribe", "agent")
.options(kafkaParams)
.load()
.as[DeserializedFromKafkaRecord]
.map(x => {
DeserializedFromKafkaRecord(AgentDeserializerWrapper.deser.fromBytes(x.getAs[Array[Byte]]("value"), AgentDeserializerWrapper.messageSchema).asInstanceOf[GenericData.Record].toString)
})

Spark Structured Streaming: console sink is not working as expected

I have the following code to read and process Kafka data using Structured Streaming
object ETLTest {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
run();
}
def run(): Unit = {
val spark = SparkSession
.builder
.appName("Test JOB")
.master("local[*]")
.getOrCreate()
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", "...")
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvWriter = new ForeachWriter[record] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: record) = {
println("record:: " + record)
}
def close(errorOrNull: Throwable): Unit = {}
}
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
// DOES NOT WORK
/*val query = sdvDF
.writeStream
.format("console")
.start()
.awaitTermination()*/
// WORKS
/*val query = sdvDF
.writeStream
.foreach(sdvWriter)
.start()
.awaitTermination()
*/
}
}
I am running this code from IntellijIdea IDE and when I use the foreach(sdvWriter), I could see the records consumed from Kafka, but when I use .writeStream.format("console") I do not see any records. I assume that the console write stream is maintaining some sort of checkpoint and assumes it has processed all the records. Is that the case ? Am I missing something obvious here?

reproduced your code here
both of the options worked. actually in both options without the
import spark.implicits._ it would fail so I'm not sure what you are missing. might be some dependencies configured not correctly. can you add the pom.xml?
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.Trigger
object Check {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().master("local[2]")
.getOrCreate
import spark.implicits._
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets","earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
val query = sdvDF.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}

Read json from Kafka and write json to other Kafka topic

I'm trying prepare application for Spark streaming (Spark 2.1, Kafka 0.10)
I need to read data from Kafka topic "input", find correct data and write result to topic "output"
I can read data from Kafka base on KafkaUtils.createDirectStream method.
I converted the RDD to json and prepare filters:
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val elementDstream = messages.map(v => v.value).foreachRDD { rdd =>
val PeopleDf=spark.read.schema(schema1).json(rdd)
import spark.implicits._
PeopleDf.show()
val PeopleDfFilter = PeopleDf.filter(($"value1".rlike("1"))||($"value2" === 2))
PeopleDfFilter.show()
}
I can load data from Kafka and write "as is" to Kafka use KafkaProducer:
messages.foreachRDD( rdd => {
rdd.foreachPartition( partition => {
val kafkaTopic = "output"
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
partition.foreach{ record: ConsumerRecord[String, String] => {
System.out.print("########################" + record.value())
val messageResult = new ProducerRecord[String, String](kafkaTopic, record.value())
producer.send(messageResult)
}}
producer.close()
})
})
However, I cannot integrate those two actions > find in json proper value and write findings to Kafka: write PeopleDfFilter in JSON format to "output" Kafka topic.
I have a lot of input messages in Kafka, this is the reason I want to use foreachPartition to create the Kafka producer.

The process is very simple so why not use structured streaming all the way?
import org.apache.spark.sql.functions.from_json
spark
// Read the data
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", inservers)
.option("subscribe", intopic)
.load()
// Transform / filter
.select(from_json($"value".cast("string"), schema).alias("value"))
.filter(...) // Add the condition
.select(to_json($"value").alias("value")
// Write back
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", outservers)
.option("subscribe", outtopic)
.start()

Try using Structured Streaming for that. Even if you used Spark 2.1, you may implement your own Kafka ForeachWriter as followed:
Kafka sink:
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
import org.apache.spark.sql.ForeachWriter
class KafkaSink(topic:String, servers:String) extends ForeachWriter[(String, String)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
kafkaProperties.put("value.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (String, String)): Unit = {
producer.send(new ProducerRecord(topic, value._1 + ":" + value._2))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
Usage:
val topic = "<topic2>"
val brokers = "<server:ip>"
val writer = new KafkaSink(topic, brokers)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to use foreachPartition in Spark 2.2 to avoid Task Serialization error - scala

Related

Histogram/Counter metrics for Prometheus in Spark Job readStream/writeStream (Kafka to parquet). How to send metric from LOOP or Event or Listener

Kafka Unit testing

enocder issue- Spark Structured streaming- works in repl only

Spark Structured Streaming: console sink is not working as expected

Read json from Kafka and write json to other Kafka topic

Categories

Resources