Spark Structured Streaming: console sink is not working as expected - scala

I have the following code to read and process Kafka data using Structured Streaming
object ETLTest {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
run();
}
def run(): Unit = {
val spark = SparkSession
.builder
.appName("Test JOB")
.master("local[*]")
.getOrCreate()
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", "...")
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvWriter = new ForeachWriter[record] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: record) = {
println("record:: " + record)
}
def close(errorOrNull: Throwable): Unit = {}
}
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
// DOES NOT WORK
/*val query = sdvDF
.writeStream
.format("console")
.start()
.awaitTermination()*/
// WORKS
/*val query = sdvDF
.writeStream
.foreach(sdvWriter)
.start()
.awaitTermination()
*/
}
}
I am running this code from IntellijIdea IDE and when I use the foreach(sdvWriter), I could see the records consumed from Kafka, but when I use .writeStream.format("console") I do not see any records. I assume that the console write stream is maintaining some sort of checkpoint and assumes it has processed all the records. Is that the case ? Am I missing something obvious here?

reproduced your code here
both of the options worked. actually in both options without the
import spark.implicits._ it would fail so I'm not sure what you are missing. might be some dependencies configured not correctly. can you add the pom.xml?
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.Trigger
object Check {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().master("local[2]")
.getOrCreate
import spark.implicits._
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets","earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
val query = sdvDF.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}

Related

Histogram/Counter metrics for Prometheus in Spark Job readStream/writeStream (Kafka to parquet). How to send metric from LOOP or Event or Listener

How to send Histogram/Counter metrics for Prometheus from Spark job in:
Loop
foreachBatch
methods of ForeachWriter
Spark events
using org.apache.spark.metrics.source.Source in Spark job with stream?
I'm able to accumulate metrics in collection accumulator(s), but I cannot find context where I can send accumulated metrics without issue of compilation or execution.
Common issue:
22/11/28 14:24:36 ERROR MicroBatchExecution: Query [id = 5d2fc03c-1dbc-4bb1-a821-397586d22cf4, runId = e665dcd2-6e3d-4b03-8684-11844de040f0] terminated with error
org.apache.spark.SparkException: Task not serializable
or
Spark job is stopped in ~15 seconds on the spark worker after start with different variation of the error messages.
Found solution:
It works on local env. with simple spark-submit, but it doesn't work with the cluster. Collection returned by SparkEnv.get.metricsSystem.getSourcesByName is always empty.
https://gist.github.com/ambud/641f8fc25f7f8d3923d6fd10f64b7184
I see only doubted ways to fix this issue. I don't believe that there's no any common solution.
package org.apache.spark.metrics.source
import com.codahale.metrics.{Counter, Histogram, MetricRegistry}
class PrometheusMetricSource extends Source {
override val sourceName: String = "PrometheusMetricSource"
override val metricRegistry: MetricRegistry = new MetricRegistry
val myMetric: Histogram = metricRegistry.histogram(MetricRegistry.name("myMetric"))
}
import org.apache.spark.SparkEnv
import org.apache.spark.metrics.source.PrometheusMetricSource
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, SparkSession}
object Example {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("My Spark job").getOrCreate()
import spark.implicits._
val source: PrometheusMetricSource = new PrometheusMetricSource
SparkEnv.get.metricsSystem.registerSource(source)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "my-topic")
.option("startingOffsets", "earliest")
.load()
val ds: Dataset[String] =
df.select(col("value"))
.as[String]
.map { str =>
source.myMetric.update(1L) // submit metric ////////////////////////
str + "test"
}
ds.writeStream
.foreachBatch {
(batchDF: Dataset[String],
batchId: Long) =>
source.myMetric.update(1L) // submit metric ////////////////////////
}
.foreach(new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = true
def close(errorOrNull: Throwable): Unit = {}
def process(record: String) = {
source.myMetric.update(1L) // submit metric ////////////////////////
}
})
.outputMode("append")
.format("parquet")
.option("path", "/share/parquet")
.option("checkpointLocation", "/share/checkpoints")
.start()
.awaitTermination()
}
}

Kafka Unit testing

I have a function kafkaIngestion which creates a df from kafkatopic in the following way:
def kafkaIngestion(spark:sparksession):Dataframe = {
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("group.id", grpid)
.load()
.selectExpr("cast(value as string) as data")
.select(from_json($"data", schema=inputSchema)
.as("data")
.select("data.*")
df
}
I am unable to mock the the code to return my expected df. What's the correct way to mock the df?

Writing sorted dataframe into kafka topic (sorted order rows) in spark structured streaming using scala

while displaying sorting results to console results are showing as expected in sorting order, but when i push those results to kafka topic the sorting order is missing
def main(args: Array[String]) = {
//Spark config and kafka config
// load method
val Raw_df = readStream(sparkSession, inputtopic)
//converting read kafka mesages into json format
val df_messages = Raw_df.selectExpr("CAST(value AS STRING)")
.withColumn("data", from_json($"value", my_schema))
.select("data.*")
val win = window($"date_column","5 minutes")
val modified_df = df_messages.withWatermark("date_column", "3 minutes")
.groupBy(win,$"All_colums", $"date_column")
.count()
.orderBy(asc("date_column"),asc("column_5"))
val finalcol = modified_df.drop("count").drop("window")
//mapping all columsn and converting them to json mesages
val finalcolonames = my_schema.fields.map(z => z.name)
val dataset_Json = finalcol.withColumn("value", to_json(struct(finalcolonames.map(y => col(y)): _*)))
.select($"value")
//val query = writeToKafkaStremoutput(dataset_Json, outputtopic,checkpointlocation)
val query = writeToConsole(order)
(query)
}
//below method write data to kafka topic
def writeToKafkaStremoutput(dataFrame: DataFrame, Config: Config, topic: String,checkpointlocation:String) = {
dataFrame
.selectExpr( "CAST(value AS STRING)")
.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime("1 second"))
.option("topic", topic)
.option("kafka.bootstrap.servers", "kafka.bootstrap_servers")
.option("checkpointLocation",checkpointPath)
.outputMode(OutputMode.Complete())
.start()
}
//console op for testing
// below method write data toconsole
def writeToConsole(dataFrame: DataFrame) = {
import org.apache.spark.sql.streaming.Trigger
val query = dataFrame
.writeStream
.format("console")
.option("numRows",300)
//.trigger(Trigger.ProcessingTime("20 second"))
.outputMode(OutputMode.Complete())
.option("truncate", false)
.start()
query
}

How to use foreachPartition in Spark 2.2 to avoid Task Serialization error

I have the following working code that uses Structured Streaming (Spark 2.2) in order to read data from Kafka (0.10).
The only issue that I cannot solve is related to Task serialization problem when using kafkaProducer inside ForeachWriter.
In my old version of this code developed for Spark 1.6 I was using foreachPartition and I was defining kafkaProducer for each partition to avoid Task Serialization problem.
How can I do it in Spark 2.2?
val df: Dataset[String] = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "true")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
.map(_._2)
var mySet = spark.sparkContext.broadcast(Map(
"metadataBrokerList"->metadataBrokerList,
"outputKafkaTopic"->outputKafkaTopic,
"batchSize"->batchSize,
"lingerMS"->lingerMS))
val kafkaProducer = Utils.createProducer(mySet.value("metadataBrokerList"),
mySet.value("batchSize"),
mySet.value("lingerMS"))
val writer = new ForeachWriter[String] {
override def process(row: String): Unit = {
// val result = ...
val record = new ProducerRecord[String, String](mySet.value("outputKafkaTopic"), "1", result);
kafkaProducer.send(record)
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
}
val query = df
.writeStream
.foreach(writer)
.start
query.awaitTermination()
spark.stop()
Write implementation of ForeachWriter and than use it. (Avoid anonymous classes with not serializable objects - in your case its ProducerRecord)
Example: val writer = new YourForeachWriter[String]
Also here is helpful article about Spark Serialization problems: https://www.cakesolutions.net/teamblogs/demystifying-spark-serialisation-error

How to display the records from Kafka to console?

I'm learning Structured Streaming and I was not able to display the output to my console.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime
object kafka_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
// val schema = StructType().add("a", IntegerType()).add("b", StringType())
val schema = StructType(Seq(
StructField("a", IntegerType, true),
StructField("b", StringType, true)
))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.load()
val values = df.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
My input to Kafka
my name is abc how are you ?
I just want to display strings from Kafka to spark console