Each query takes more time using Structured Streaming with Spark - scala

I'm using Spark 2.3.0, Scala 2.11.8 and Kafka and I'm trying to write into parquet files all the messages from Kafka with Structured Streaming but for each query that my implementation does the total time for each one increase a lot Spark Stages Image.
I would like to know why this happens, I tried with different possibles triggers (Continues,0 seconds, 1 seconds, 10 seconds,10 minutes, etc) and always I get the same behavior. My code has this structure:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, SparkSession}
import com.name.proto.ProtoMessages
import java.io._
import java.text.{DateFormat, SimpleDateFormat}
import java.util.Date
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
object StructuredStreaming {
def message_proto(value:Array[Byte]): Map[String, String] = {
try {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val impression_proto = ProtoMessages.TrackingRequest.parseFrom(value)
val json = Map(
"id_req" -> (impression_proto.getIdReq().toString),
"ts_imp_request" -> (impression_proto.getTsRequest().toString),
"is_after" -> (impression_proto.getIsAfter().toString),
"type" -> (impression_proto.getType().toString)
)
return json
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
return Map("error" -> "error")
}
}
def main(args: Array[String]){
val proto_impressions_udf = udf(message_proto _)
val spark = SparkSession.builder.appName("Structured Streaming ").getOrCreate()
//fetchOffset.numRetries, fetchOffset.retryIntervalMs
val stream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "ip:9092")
.option("subscribe", "ssp.impressions")
.option("startingOffsets", "latest")
.option("max.poll.records", "1000000")
.option("auto.commit.interval.ms", "100000")
.option("session.timeout.ms", "10000")
.option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
.option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
.option("failOnDataLoss", "false")
.option("latestFirst", "true")
.load()
try{
val query = stream.select(col("value").cast("string"))
.select(proto_impressions_udf(col("value")) as "value_udf")
.select(col("value_udf")("id_req").as("id_req"), col("value_udf")("is_after").as("is_after"),
date_format(col("value_udf")("ts_request"), "yyyy").as("date").as("year"),
date_format(col("value_udf")("ts_request"), "MM").as("date").as("month"),
date_format(col("value_udf")("ts_request"), "dd").as("date").as("day"),
date_format(col("value_udf")("ts_request"), "HH").as("date").as("hour"))
val query2 = query.writeStream.format("parquet")
.option("checkpointLocation", "/home/data/impressions/checkpoint")
.option("path", "/home/data/impressions")
.outputMode(OutputMode.Append())
.partitionBy("year", "month", "day", "hour")
.trigger(Trigger.ProcessingTime("1 seconds"))
.start()
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
}
}
}
I attached others images from the Spark UI:

Your problem is related to the batches, you need to define a good time for processing each batch, and that depends on your cluster processing power. Also, the time for solve each batch depends whether you are receiving all the fields without null because if you receive a lot of fields on null the process will take less time to process the batch.

Related

Histogram/Counter metrics for Prometheus in Spark Job readStream/writeStream (Kafka to parquet). How to send metric from LOOP or Event or Listener

How to send Histogram/Counter metrics for Prometheus from Spark job in:
Loop
foreachBatch
methods of ForeachWriter
Spark events
using org.apache.spark.metrics.source.Source in Spark job with stream?
I'm able to accumulate metrics in collection accumulator(s), but I cannot find context where I can send accumulated metrics without issue of compilation or execution.
Common issue:
22/11/28 14:24:36 ERROR MicroBatchExecution: Query [id = 5d2fc03c-1dbc-4bb1-a821-397586d22cf4, runId = e665dcd2-6e3d-4b03-8684-11844de040f0] terminated with error
org.apache.spark.SparkException: Task not serializable
or
Spark job is stopped in ~15 seconds on the spark worker after start with different variation of the error messages.
Found solution:
It works on local env. with simple spark-submit, but it doesn't work with the cluster. Collection returned by SparkEnv.get.metricsSystem.getSourcesByName is always empty.
https://gist.github.com/ambud/641f8fc25f7f8d3923d6fd10f64b7184
I see only doubted ways to fix this issue. I don't believe that there's no any common solution.
package org.apache.spark.metrics.source
import com.codahale.metrics.{Counter, Histogram, MetricRegistry}
class PrometheusMetricSource extends Source {
override val sourceName: String = "PrometheusMetricSource"
override val metricRegistry: MetricRegistry = new MetricRegistry
val myMetric: Histogram = metricRegistry.histogram(MetricRegistry.name("myMetric"))
}
import org.apache.spark.SparkEnv
import org.apache.spark.metrics.source.PrometheusMetricSource
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, SparkSession}
object Example {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("My Spark job").getOrCreate()
import spark.implicits._
val source: PrometheusMetricSource = new PrometheusMetricSource
SparkEnv.get.metricsSystem.registerSource(source)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "my-topic")
.option("startingOffsets", "earliest")
.load()
val ds: Dataset[String] =
df.select(col("value"))
.as[String]
.map { str =>
source.myMetric.update(1L) // submit metric ////////////////////////
str + "test"
}
ds.writeStream
.foreachBatch {
(batchDF: Dataset[String],
batchId: Long) =>
source.myMetric.update(1L) // submit metric ////////////////////////
}
.foreach(new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = true
def close(errorOrNull: Throwable): Unit = {}
def process(record: String) = {
source.myMetric.update(1L) // submit metric ////////////////////////
}
})
.outputMode("append")
.format("parquet")
.option("path", "/share/parquet")
.option("checkpointLocation", "/share/checkpoints")
.start()
.awaitTermination()
}
}

Error: Using Spark Structured Streaming to read and write data to another topic in kafka

I am doing a small task of reading access_logs file using a kafka topic, then i count the status and send the count of Status to another kafka topic.
But i keep getting errors like,
while i use no output mode or append mode:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
When using complete mode:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: requirement failed: KafkaTable does not support Complete mode.
This is my code:
structuredStreaming.scala
package com.spark.sparkstreaming
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.functions._
import java.util.regex.Pattern
import java.util.regex.Matcher
import java.text.SimpleDateFormat
import java.util.Locale
import Utilities._
object structuredStreaming {
case class LogEntry(ip:String, client:String, user:String, dateTime:String, request:String, status:String, bytes:String, referer:String, agent:String)
val logPattern = apacheLogPattern()
val datePattern = Pattern.compile("\\[(.*?) .+]")
def parseDateField(field: String): Option[String] = {
val dateMatcher = datePattern.matcher(field)
if (dateMatcher.find) {
val dateString = dateMatcher.group(1)
val dateFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH)
val date = (dateFormat.parse(dateString))
val timestamp = new java.sql.Timestamp(date.getTime());
return Option(timestamp.toString())
} else {
None
}
}
def parseLog(x:Row) : Option[LogEntry] = {
val matcher:Matcher = logPattern.matcher(x.getString(0));
if (matcher.matches()) {
val timeString = matcher.group(4)
return Some(LogEntry(
matcher.group(1),
matcher.group(2),
matcher.group(3),
parseDateField(matcher.group(4)).getOrElse(""),
matcher.group(5),
matcher.group(6),
matcher.group(7),
matcher.group(8),
matcher.group(9)
))
} else {
return None
}
}
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.config("spark.sql.streaming.checkpointLocation", "/home/UDHAV.MAHATA/Documents/Checkpoints")
.getOrCreate()
setupLogging()
// val rawData = spark.readStream.text("/home/UDHAV.MAHATA/Documents/Spark/logs")
val rawData = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testing")
.load()
import spark.implicits._
val structuredData = rawData.flatMap(parseLog).select("status")
val windowed = structuredData.groupBy($"status").count()
//val query = windowed.writeStream.outputMode("complete").format("console").start()
val query = windowed
.writeStream
.outputMode("complete")
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "sink")
.start()
query.awaitTermination()
spark.stop()
}
}
Utilities.scala
package com.spark.sparkstreaming
import org.apache.log4j.Level
import java.util.regex.Pattern
import java.util.regex.Matcher
object Utilities {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
def apacheLogPattern():Pattern = {
val ddd = "\\d{1,3}"
val ip = s"($ddd\\.$ddd\\.$ddd\\.$ddd)?"
val client = "(\\S+)"
val user = "(\\S+)"
val dateTime = "(\\[.+?\\])"
val request = "\"(.*?)\""
val status = "(\\d{3})"
val bytes = "(\\S+)"
val referer = "\"(.*?)\""
val agent = "\"(.*?)\""
val regex = s"$ip $client $user $dateTime $request $status $bytes $referer $agent"
Pattern.compile(regex)
}
}
Can anyone help me with where am i doing mistake?
As the error message is suggesting, you need to add a watermark to your grouping.
Replace this line
val windowed = structuredData.groupBy($"status").count()
with
import org.apache.spark.sql.functions.{window, col}
val windowed = structuredData.groupBy(window(col("dateTime"), "10 minutes"), "status").count()
It is important that the column dateTime is of type timestamp which you parse from the Kafka source anyway if I understood your code correctly.
Without the window, Spark will not know how much data to be aggregated.

How to perform Unit testing on Spark Structured Streaming?

I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data.
I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming.
tl;dr Use MemoryStream to add events and memory sink for the output.
The following code should help to get started:
import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
import spark.implicits._
val events = MemoryStream[Event]
val sessions = events.toDS
assert(sessions.isStreaming, "sessions must be a streaming Dataset")
// use sessions event stream to apply required transformations
val transformedSessions = ...
val streamingQuery = transformedSessions
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(queryOutputMode)
.start
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
eventGen.generate(userId = 1, offset = 1.second),
eventGen.generate(userId = 2, offset = 2.seconds))
val currentOffset = events.addData(batch)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
// check the output
// The output is in queryName table
// The following code simply shows the result
spark
.table(queryName)
.show(truncate = false)
So, I tried to implement the answer from #Jacek and I couldn't find how to create the eventGen object and also test a small streaming application for write data on the console. I am also using MemoryStream and here I show a small example working.
The class that I testing is:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession, functions}
object StreamingDataFrames {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(StreamingDataFrames.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lines = readData(spark, "socket")
val streamingQuery = writeData(lines)
streamingQuery.awaitTermination()
}
def readData(spark: SparkSession, source: String = "socket"): DataFrame = {
val lines: DataFrame = spark.readStream
.format(source)
.option("host", "localhost")
.option("port", 12345)
.load()
lines
}
def writeData(df: DataFrame, sink: String = "console", queryName: String = "calleventaggs", outputMode: String = "append"): StreamingQuery = {
println(s"Is this a streaming data frame: ${df.isStreaming}")
val shortLines: DataFrame = df.filter(functions.length(col("value")) >= 3)
val query = shortLines.writeStream
.format(sink)
.queryName(queryName)
.outputMode(outputMode)
.start()
query
}
}
I test only the writeData method. This is way I split the query into 2 methods.
Then here is the Spec to test the class. I use a SharedSparkSession class to facilitate the open and close of spark context. Like it is shown here.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.github.explore.spark.SharedSparkSession
import org.scalatest.funsuite.AnyFunSuite
class StreamingDataFramesSpec extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}

How do I stream data to Neo4j using Spark

I am trying to write streaming data to Neo4j using Spark and am having some problems (I am very new to Spark).
I have tried setting up a stream of word counts and can write this to Postgres using a custom ForeachWriter as in the example here. So I think that I understand the basic flow.
I have then tried to replicate this and send the data to Neo4j instead using the neo4j-spark-connector. I am able to send data to Neo4j using the example in the Zeppelin notebook here. So I've tried to transfer this code across to the ForeachWriter but I've got a problem - the sparkContext is not available in the ForeachWriter and from what I have read it shouldn't be passed in because it runs on the driver while the foreach code runs on the executors. Can anyone help with what I should do in this situation?
Sink.scala:
val spark = SparkSession
.builder()
.appName("Neo4jSparkConnector")
.config("spark.neo4j.bolt.url", "bolt://hdp1:7687")
.config("spark.neo4j.bolt.password", "pw")
.getOrCreate()
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
wordCounts.printSchema()
val writer = new Neo4jSink()
import org.apache.spark.sql.streaming.ProcessingTime
val query = wordCounts
.writeStream
.foreach(writer)
.outputMode("append")
.trigger(ProcessingTime("25 seconds"))
.start()
query.awaitTermination()
Neo4jSink.scala:
class Neo4jSink() extends ForeachWriter[Row]{
def open(partitionId: Long, version: Long):Boolean = {
true
}
def process(value: Row): Unit = {
val word = ("Word", Seq("value"))
val word_count = ("WORD_COUNT", Seq.empty)
val count = ("Count", Seq("count"))
Neo4jDataFrame.mergeEdgeList(sparkContext, value, word, word_count, count)
}
def close(errorOrNull:Throwable):Unit = {
}
}

Spark Structured Streaming: console sink is not working as expected

I have the following code to read and process Kafka data using Structured Streaming
object ETLTest {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
run();
}
def run(): Unit = {
val spark = SparkSession
.builder
.appName("Test JOB")
.master("local[*]")
.getOrCreate()
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", "...")
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvWriter = new ForeachWriter[record] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: record) = {
println("record:: " + record)
}
def close(errorOrNull: Throwable): Unit = {}
}
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
// DOES NOT WORK
/*val query = sdvDF
.writeStream
.format("console")
.start()
.awaitTermination()*/
// WORKS
/*val query = sdvDF
.writeStream
.foreach(sdvWriter)
.start()
.awaitTermination()
*/
}
}
I am running this code from IntellijIdea IDE and when I use the foreach(sdvWriter), I could see the records consumed from Kafka, but when I use .writeStream.format("console") I do not see any records. I assume that the console write stream is maintaining some sort of checkpoint and assumes it has processed all the records. Is that the case ? Am I missing something obvious here?
reproduced your code here
both of the options worked. actually in both options without the
import spark.implicits._ it would fail so I'm not sure what you are missing. might be some dependencies configured not correctly. can you add the pom.xml?
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.Trigger
object Check {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().master("local[2]")
.getOrCreate
import spark.implicits._
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets","earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
val query = sdvDF.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}