NotSerializableException with Neo4j Spark Streaming Scala - scala

I am trying to run a query in Neo4j using Neo4j-Spark connector. I want to pass the values from the stream (produced by Kafka as a String) into my query. However, I get serialization exception:
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#54688d9f)
- field (class: consumer.SparkConsumer$$anonfun$processingLogic$2, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class consumer.SparkConsumer$$anonfun$processingLogic$2, <function1>)
- field (class: consumer.SparkConsumer$$anonfun$processingLogic$2$$anonfun$apply$3, name: $outer, type: class consumer.SparkConsumer$$anonfun$processingLogic$2)
- object (class consumer.SparkConsumer$$anonfun$processingLogic$2$$anonfun$apply$3, <function1>)
Here is the code for the main function and querying logic:
object SparkConsumer {
def main(args: Array[String]) {
val config = "neo4j_local"
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("KafkaSparkStreaming")
setNeo4jSparkConfig(config, sparkConf)
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val streamingContext = new StreamingContext(sparkSession.sparkContext, Seconds(3))
streamingContext.sparkContext.setLogLevel("ERROR")
val sqlContext = new SQLContext(streamingContext.sparkContext)
val numStreams = 2
val topics = Array("member_topic1")
def kafkaParams(i: Int) = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group2",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val lines = (1 to numStreams).map(i => KafkaUtils.createDirectStream[String, String](
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams(i))
))
val messages = streamingContext.union(lines)
val wordsArrays = values.map(_.split(","))
wordsArrays.foreachRDD(rdd => rdd.foreach(
data => execNeo4jSearchQuery(data)(streamingContext.sparkContext)
))
streamingContext.start()
streamingContext.awaitTermination()
}
def execNeo4jSearchQuery(data: Array[String])(implicit sc: SparkContext) = {
val neo = Neo4j(sc)
val query = "my query"
val paramsMap = Map("lat" -> data(1).toDouble, "lon" -> data(2).toDouble, "id" -> data(0).toInt)
val df = neo.cypher(query, paramsMap).loadDataFrame("group_name" -> "string", "event_name" -> "string", "venue_name" -> "string", "distance" -> "double")
println("\ndf:")
df.show()
}
}

It is not allowed to access SparkContext, SparkSession or create distrbuted data structures from an executor. Therefore:
wordsArrays.foreachRDD(rdd => rdd.foreach(
data => execNeo4jSearchQuery(data)(streamingContext.sparkContext)
))
where execNeo4jSearchQuery calls:
neo.cypher(query, paramsMap).loadDataFrame
is not valid Spark code.
If you want to access Neo4j directly from RDD.foreach you have to use standard client (AnormCypher seems to provide very elegant API), without conversion to Spark distributed structures.
A bit unrelated note - you might consider using a single connection for the set of records with foreachPartition (also SPARK Cost of Initalizing Database Connection in map / mapPartitions context).

Related

covert Spark rdd[row] to dataframe

I have trouble to transform json to dataframe.
I am trying to use spark for a project to synchronize table to data lake(hudi) from a CDC(canal) listening mysql binlog. Here I received json about row change and add some fields for it. this json steam include multiple schema. each schemahave different colums and may add new column in the future.so I build GenericRowWithSchema for each json and pass individual schema for each row.
Now, I need to tranform rdd[row] to dataframe to write to hudi How I can trans it?
object code{
def main(args: Array[String]): Unit = {
val sss = SparkSession.builder().appName("SparkHudi").getOrCreate()
//val sc = SparkContext.getOrCreate
val sc = sss.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
//ssc.sparkContext.setLogLevel("INFO");
import org.apache.kafka.common.serialization.StringDeserializer
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "kafka.test.com:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.GROUP_ID_CONFIG -> "group-88",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> Boolean.box(true)
)
val topics = Array("test")
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
//Cache your RDD before you perform any heavyweight operations.
kafkaDirectStream.start()
val saveRdd = kafkaDirectStream.map(x => {
//receive json from kafka
val jsonObject = JSON.parse(x.value()).asInstanceOf[JSONObject]
jsonObject
}).filter(json =>{
/*some json field operation*/
val keySets = dataJson.keySet()
val dataArray:ArrayBuffer[AnyRef] = ArrayBuffer[AnyRef]()
val fieldArray:ArrayBuffer[StructField] = ArrayBuffer[StructField]()
keySets.forEach(dataKey=>{
fieldArray.append(
StructField(dataKey,SqlTypeConverter.toSparkType(sqlTypeJson.getIntValue(dataKey))))
dataArray.append(dataJson.get(dataKey));
})
val schema = StructType(fieldArray)
val row = new GenericRowWithSchema(dataArray.toArray, schema).asInstanceOf[Row]
row
})
saveRdd.foreachRDD ( rdd => {
// Get the offset ranges in the RDD
//println(rdd.map(x => x.toJSONString()).toDebugString());
sss.implicits
rdd.collect().foreach(x=>{
println(x.json)
println(x.schema.sql)
})
})

overloaded method value createDirectStream with alternatives

My spark version is 1.6.2, And My kafka version is 0.10.1.0. And I want to send a custom object as the kafka value type and I try to push this custom object into the kafka topic. And use spark streaming to read the data. And I'm using Direct approach. The following is my code:
import com.xxxxx.kafka.{KafkaJsonDeserializer, KafkaObjectDecoder, pharmacyData}
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object sparkReadKafka {
val sparkConf = new SparkConf().setAppName("SparkReadKafka")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(1))
def main(args: Array[String]): Unit = {
val kafkaParams = Map[String, Object] (
"bootstrap.servers" -> "kafka.kafka-cluster-shared.non-prod-5-az-scus.prod.us.xxxxx.net:9092",
//"key.deserializer" -> classOf[StringDeserializer],
//"value.deserializer" -> classOf[KafkaJsonDeserializer],
"group.id" -> "consumer-group-2",
"auto.offset.reset" -> "earliest",
"auto.commit.interval.ms" -> "1000",
"enable.auto.commit" -> (false: java.lang.Boolean),
"session.timeout.ms" -> "30000"
)
val topic = "hw_insights"
val stream = KafkaUtils.createDirectStream[String, pharmacyData, StringDecoder, KafkaObjectDecoder](ssc, kafkaParams, Set(topic))
}
}
And the error I got is similar to this(I have to remove some part for security purpose):
Error:(29, 47) overloaded method value createDirectStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyClass: Class[String],valueClass: Class[com.xxxxxxx.kafka.pharmacyData],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[com.xxxxxxx.kafka.KafkaObjectDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Set[String])org.apache.spark.streaming.api.java.JavaPairInputDStream[String,com.xxxxxxx.kafka.pharmacyData]
(ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Set[String])(implicit evidence$19: scala.reflect.ClassTag[String], implicit evidence$20: scala.reflect.ClassTag[com.xxxxxxx.kafka.pharmacyData], implicit evidence$21: scala.reflect.ClassTag[kafka.serializer.StringDecoder], implicit evidence$22: scala.reflect.ClassTag[com.xxxxxxx.kafka.KafkaObjectDecoder])org.apache.spark.streaming.dstream.InputDStream[(String, com.xxxxxxx.kafka.pharmacyData)]
cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,Object], scala.collection.immutable.Set[String])
val stream = KafkaUtils.createDirectStream[String, pharmacyData, StringDecoder, KafkaObjectDecoder](ssc, kafkaParams, Set(topic))
And below is my customer decoder class:
import kafka.serializer.Decoder
import org.codehaus.jackson.map.ObjectMapper
class KafkaObjectDecoder extends Decoder[pharmacyData] {
override def fromBytes(bytes: Array[Byte]): pharmacyData = {
val mapper = new ObjectMapper()
val pdata = mapper.readValue(bytes, classOf[pharmacyData])
pdata
}
}
Can someone please help me with issues? Thannk you!
The error is saying your parameters are incorrect
cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,Object], scala.collection.immutable.Set[String])
The closest method it thinks you want is
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyClass: Class[String],valueClass: Class[com.xxxxxxx.kafka.pharmacyData],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[com.xxxxxxx.kafka.KafkaObjectDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Set[String])

error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]

I am trying to capture Kafka events (which I am getting in serialised form) using sparkStreaming in Scala.
Here is my code-snippet:
val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"auto.offset.reset" -> "earliest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> groupId,
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
messages.foreachRDD { rdd =>
println(rdd.toDF())
}
ssc.start()
ssc.awaitTermination()
I am getting error message as:
Error:(59, 19) value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] println(rdd.toDF())
toDF comes through DatasetHolder
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits
I haven't replicated it but my guess is that there's no encoder for ConsumerRecord[String, String] so you can either provide one or map it first to something for which an Encoder can be derived (case class or a primitive)
also println within foreachRDD will probably not act the way you want due to the distributed nature of spark

KafkaConsumer is not safe for multi-threaded access Spark Streaming Scala

I'm trying to join 2 different streams coming from apache Kafka (2 different topics) in Apache Spark Streaming on a cluster of machines.
The messages I send are string "formatted" as a csv string (comma separated).
This is the Spark code:
// Create the context with a 5 second batch size
val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").set("spark.streaming.concurrentJobs", "3").setMaster("spark://0.0.0.0:7077")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
case class Location(latitude: Double, longitude: Double, name: String)
case class Datas1(location : Location, timestamp : String, measurement : Double, unit: String, accuracy : Double, elem: String, elems: String, elemss: String)
case class Sensors1(sensor_name: String, start_date: String, end_date: String, data1: Datas1)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "0.0.0.0:9092",
"key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"group.id" -> "test_luca",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics1 = Array("topics1")
val stream1 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams))
val s1pre = stream1.map(record => record.value.split(",").map(_.trim))
val s1 = s1pre.map(x => Sensors1(x.apply(6), "2016-03-01T00:00:00.000", "2018-09-01T00:00:00.000", Datas1(Location(x.apply(1).toDouble,x.apply(2).toDouble, ""), x.apply(0) ,x.apply(3).toDouble,x.apply(5),x.apply(4).toDouble,x.apply(7),x.apply(8),x.apply(9)))
val topics2 = Array("topics2")
val stream2 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics2, kafkaParams))
val s2pre = stream2.map(record => record.value.split(",").map(_.trim))
val s2 = s2pre.map(x => Sensors1(x.apply(6), "2016-03-01T00:00:00.000", "2018-09-01T00:00:00.000", Datas1(Location(x.apply(1).toDouble,x.apply(2).toDouble, ""), x.apply(0) ,x.apply(3).toDouble,x.apply(5),x.apply(4).toDouble,x.apply(7),x.apply(8),x.apply(9)))
val j1s1 = s1.map(x => (x.data1.timestamp, (x)))
val j1s2 = s2.map(x => (x.data1.timestamp, (x)))
val j1s1win = j1s1.window(Seconds(3), Seconds(6))
val j1s2win = j1s2.window(Seconds(3), Seconds(6))
val j1pre = j1s1win.join(j1s2win)
case class Sensorj1(sensor_name: String, start_date: String, end_date: String)
val j1 = j1pre.map { r => new Sensorj1("j1", r._2._1.start_date, r._2._1.end_date)}
j1.print()
The problem I have is "KafkaConsumer is not safe for multi-threaded access".
After reading different posts I changed my code by adding cache() at the end of the KafkaConsumer (val stream1 and val stream2).
After that I do not have the same error but I have a serialization error on the string I try to map.
I do not understand and I do not have any idea on how to fix this problem.
Any suggestions?
Thanks
LF

Receiving empty data from Kafka - Spark Streaming

Why am I getting empty data messages when I read a topic from kafka?
Is it a problem with the Decoder?
*There is no error or exception.
Code:
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Queue Status")
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint("/tmp/")
val kafkaConfig = Map("zookeeper.connect" -> "ip.internal:2181",
"group.id" -> "queue-status")
val kafkaTopics = Map("queue_status" -> 1)
val kafkaStream = KafkaUtils.createStream[String, QueueStatusMessage, StringDecoder, QueueStatusMessageKafkaDeserializer](
ssc,
kafkaConfig,
kafkaTopics,
StorageLevel.MEMORY_AND_DISK)
kafkaStream.window(Minutes(1),Seconds(10)).print()
ssc.start()
ssc.awaitTermination()
}
The Kafka decoder:
class QueueStatusMessageKafkaDeserializer(props: VerifiableProperties = null) extends Decoder[QueueStatusMessage] {
override def fromBytes(bytes: Array[Byte]): QueueStatusMessage = QueueStatusMessage.parseFrom(bytes)
}
The (empty) result:
-------------------------------------------
Time: 1440010266000 ms
-------------------------------------------
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
Solution:
Just strictly specified the types in the Kafka topic Map:
val kafkaTopics = Map[String, Int]("queue_status" -> 1)
Still don't know the reason for the problem, but the code is working fine now.