How do I stream data to Neo4j using Spark - scala

I am trying to write streaming data to Neo4j using Spark and am having some problems (I am very new to Spark).
I have tried setting up a stream of word counts and can write this to Postgres using a custom ForeachWriter as in the example here. So I think that I understand the basic flow.
I have then tried to replicate this and send the data to Neo4j instead using the neo4j-spark-connector. I am able to send data to Neo4j using the example in the Zeppelin notebook here. So I've tried to transfer this code across to the ForeachWriter but I've got a problem - the sparkContext is not available in the ForeachWriter and from what I have read it shouldn't be passed in because it runs on the driver while the foreach code runs on the executors. Can anyone help with what I should do in this situation?
Sink.scala:
val spark = SparkSession
.builder()
.appName("Neo4jSparkConnector")
.config("spark.neo4j.bolt.url", "bolt://hdp1:7687")
.config("spark.neo4j.bolt.password", "pw")
.getOrCreate()
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
wordCounts.printSchema()
val writer = new Neo4jSink()
import org.apache.spark.sql.streaming.ProcessingTime
val query = wordCounts
.writeStream
.foreach(writer)
.outputMode("append")
.trigger(ProcessingTime("25 seconds"))
.start()
query.awaitTermination()
Neo4jSink.scala:
class Neo4jSink() extends ForeachWriter[Row]{
def open(partitionId: Long, version: Long):Boolean = {
true
}
def process(value: Row): Unit = {
val word = ("Word", Seq("value"))
val word_count = ("WORD_COUNT", Seq.empty)
val count = ("Count", Seq("count"))
Neo4jDataFrame.mergeEdgeList(sparkContext, value, word, word_count, count)
}
def close(errorOrNull:Throwable):Unit = {
}
}

Related

Histogram/Counter metrics for Prometheus in Spark Job readStream/writeStream (Kafka to parquet). How to send metric from LOOP or Event or Listener

How to send Histogram/Counter metrics for Prometheus from Spark job in:
Loop
foreachBatch
methods of ForeachWriter
Spark events
using org.apache.spark.metrics.source.Source in Spark job with stream?
I'm able to accumulate metrics in collection accumulator(s), but I cannot find context where I can send accumulated metrics without issue of compilation or execution.
Common issue:
22/11/28 14:24:36 ERROR MicroBatchExecution: Query [id = 5d2fc03c-1dbc-4bb1-a821-397586d22cf4, runId = e665dcd2-6e3d-4b03-8684-11844de040f0] terminated with error
org.apache.spark.SparkException: Task not serializable
or
Spark job is stopped in ~15 seconds on the spark worker after start with different variation of the error messages.
Found solution:
It works on local env. with simple spark-submit, but it doesn't work with the cluster. Collection returned by SparkEnv.get.metricsSystem.getSourcesByName is always empty.
https://gist.github.com/ambud/641f8fc25f7f8d3923d6fd10f64b7184
I see only doubted ways to fix this issue. I don't believe that there's no any common solution.
package org.apache.spark.metrics.source
import com.codahale.metrics.{Counter, Histogram, MetricRegistry}
class PrometheusMetricSource extends Source {
override val sourceName: String = "PrometheusMetricSource"
override val metricRegistry: MetricRegistry = new MetricRegistry
val myMetric: Histogram = metricRegistry.histogram(MetricRegistry.name("myMetric"))
}
import org.apache.spark.SparkEnv
import org.apache.spark.metrics.source.PrometheusMetricSource
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, SparkSession}
object Example {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("My Spark job").getOrCreate()
import spark.implicits._
val source: PrometheusMetricSource = new PrometheusMetricSource
SparkEnv.get.metricsSystem.registerSource(source)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "my-topic")
.option("startingOffsets", "earliest")
.load()
val ds: Dataset[String] =
df.select(col("value"))
.as[String]
.map { str =>
source.myMetric.update(1L) // submit metric ////////////////////////
str + "test"
}
ds.writeStream
.foreachBatch {
(batchDF: Dataset[String],
batchId: Long) =>
source.myMetric.update(1L) // submit metric ////////////////////////
}
.foreach(new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = true
def close(errorOrNull: Throwable): Unit = {}
def process(record: String) = {
source.myMetric.update(1L) // submit metric ////////////////////////
}
})
.outputMode("append")
.format("parquet")
.option("path", "/share/parquet")
.option("checkpointLocation", "/share/checkpoints")
.start()
.awaitTermination()
}
}

How to perform Unit testing on Spark Structured Streaming?

I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data.
I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming.
tl;dr Use MemoryStream to add events and memory sink for the output.
The following code should help to get started:
import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
import spark.implicits._
val events = MemoryStream[Event]
val sessions = events.toDS
assert(sessions.isStreaming, "sessions must be a streaming Dataset")
// use sessions event stream to apply required transformations
val transformedSessions = ...
val streamingQuery = transformedSessions
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(queryOutputMode)
.start
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
eventGen.generate(userId = 1, offset = 1.second),
eventGen.generate(userId = 2, offset = 2.seconds))
val currentOffset = events.addData(batch)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
// check the output
// The output is in queryName table
// The following code simply shows the result
spark
.table(queryName)
.show(truncate = false)
So, I tried to implement the answer from #Jacek and I couldn't find how to create the eventGen object and also test a small streaming application for write data on the console. I am also using MemoryStream and here I show a small example working.
The class that I testing is:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession, functions}
object StreamingDataFrames {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(StreamingDataFrames.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lines = readData(spark, "socket")
val streamingQuery = writeData(lines)
streamingQuery.awaitTermination()
}
def readData(spark: SparkSession, source: String = "socket"): DataFrame = {
val lines: DataFrame = spark.readStream
.format(source)
.option("host", "localhost")
.option("port", 12345)
.load()
lines
}
def writeData(df: DataFrame, sink: String = "console", queryName: String = "calleventaggs", outputMode: String = "append"): StreamingQuery = {
println(s"Is this a streaming data frame: ${df.isStreaming}")
val shortLines: DataFrame = df.filter(functions.length(col("value")) >= 3)
val query = shortLines.writeStream
.format(sink)
.queryName(queryName)
.outputMode(outputMode)
.start()
query
}
}
I test only the writeData method. This is way I split the query into 2 methods.
Then here is the Spec to test the class. I use a SharedSparkSession class to facilitate the open and close of spark context. Like it is shown here.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.github.explore.spark.SharedSparkSession
import org.scalatest.funsuite.AnyFunSuite
class StreamingDataFramesSpec extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}

Each query takes more time using Structured Streaming with Spark

I'm using Spark 2.3.0, Scala 2.11.8 and Kafka and I'm trying to write into parquet files all the messages from Kafka with Structured Streaming but for each query that my implementation does the total time for each one increase a lot Spark Stages Image.
I would like to know why this happens, I tried with different possibles triggers (Continues,0 seconds, 1 seconds, 10 seconds,10 minutes, etc) and always I get the same behavior. My code has this structure:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, SparkSession}
import com.name.proto.ProtoMessages
import java.io._
import java.text.{DateFormat, SimpleDateFormat}
import java.util.Date
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
object StructuredStreaming {
def message_proto(value:Array[Byte]): Map[String, String] = {
try {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val impression_proto = ProtoMessages.TrackingRequest.parseFrom(value)
val json = Map(
"id_req" -> (impression_proto.getIdReq().toString),
"ts_imp_request" -> (impression_proto.getTsRequest().toString),
"is_after" -> (impression_proto.getIsAfter().toString),
"type" -> (impression_proto.getType().toString)
)
return json
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
return Map("error" -> "error")
}
}
def main(args: Array[String]){
val proto_impressions_udf = udf(message_proto _)
val spark = SparkSession.builder.appName("Structured Streaming ").getOrCreate()
//fetchOffset.numRetries, fetchOffset.retryIntervalMs
val stream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "ip:9092")
.option("subscribe", "ssp.impressions")
.option("startingOffsets", "latest")
.option("max.poll.records", "1000000")
.option("auto.commit.interval.ms", "100000")
.option("session.timeout.ms", "10000")
.option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
.option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
.option("failOnDataLoss", "false")
.option("latestFirst", "true")
.load()
try{
val query = stream.select(col("value").cast("string"))
.select(proto_impressions_udf(col("value")) as "value_udf")
.select(col("value_udf")("id_req").as("id_req"), col("value_udf")("is_after").as("is_after"),
date_format(col("value_udf")("ts_request"), "yyyy").as("date").as("year"),
date_format(col("value_udf")("ts_request"), "MM").as("date").as("month"),
date_format(col("value_udf")("ts_request"), "dd").as("date").as("day"),
date_format(col("value_udf")("ts_request"), "HH").as("date").as("hour"))
val query2 = query.writeStream.format("parquet")
.option("checkpointLocation", "/home/data/impressions/checkpoint")
.option("path", "/home/data/impressions")
.outputMode(OutputMode.Append())
.partitionBy("year", "month", "day", "hour")
.trigger(Trigger.ProcessingTime("1 seconds"))
.start()
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
}
}
}
I attached others images from the Spark UI:
Your problem is related to the batches, you need to define a good time for processing each batch, and that depends on your cluster processing power. Also, the time for solve each batch depends whether you are receiving all the fields without null because if you receive a lot of fields on null the process will take less time to process the batch.

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.

Saving DataStream data into MongoDB / converting DS to DF

I am able to save a Data Frame to mongoDB but my program in spark streaming gives a datastream ( kafkaStream ) and I am not able to save it in mongodb neither i am able to convert this datastream to a dataframe. Is there any library or method to do this? Any inputs are highly appreciated.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaSparkStream {
def main(args: Array[String]){
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaStream = KafkaUtils.createStream(ssc,
"localhost:2181","spark-streaming-consumer-group", Map("topic" -> 25))
kafkaStream.print()
ssc.start()
ssc.awaitTermination()
}
}
Save a DF to mongodb - SUCCESS
val mongoDbFormat = "com.stratio.datasource.mongodb"
val mongoDbDatabase = "mongodatabase"
val mongoDbCollection = "mongodf"
val mongoDbOptions = Map(
MongodbConfig.Host -> "localhost:27017",
MongodbConfig.Database -> mongoDbDatabase,
MongodbConfig.Collection -> mongoDbCollection
)
//with DataFrame methods
dataFrame.write
.format(mongoDbFormat)
.mode(SaveMode.Append)
.options(mongoDbOptions)
.save()
Access the underlying RDD from the DStream using foreachRDD, transform it to a DataFrame and use your DF function on it.
The easiest way to transform an RDD to a DataFrame is by first transforming the data into a schema, represented in Scala by a case class
case class Element(...)
val elementDStream = kafkaDStream.map(entry => Element(entry, ...))
elementDStream.foreachRDD{rdd =>
val df = rdd.toDF
df.write(...)
}
Also, watch out for Spark 2.0 where this process will completely change with the introduction of Structured Streaming, where a MongoDB connection will become a sink.