KafkaConsumer is not safe for multi-threaded access Spark Streaming Scala - scala

I'm trying to join 2 different streams coming from apache Kafka (2 different topics) in Apache Spark Streaming on a cluster of machines.
The messages I send are string "formatted" as a csv string (comma separated).
This is the Spark code:
// Create the context with a 5 second batch size
val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").set("spark.streaming.concurrentJobs", "3").setMaster("spark://0.0.0.0:7077")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
case class Location(latitude: Double, longitude: Double, name: String)
case class Datas1(location : Location, timestamp : String, measurement : Double, unit: String, accuracy : Double, elem: String, elems: String, elemss: String)
case class Sensors1(sensor_name: String, start_date: String, end_date: String, data1: Datas1)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "0.0.0.0:9092",
"key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"group.id" -> "test_luca",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics1 = Array("topics1")
val stream1 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams))
val s1pre = stream1.map(record => record.value.split(",").map(_.trim))
val s1 = s1pre.map(x => Sensors1(x.apply(6), "2016-03-01T00:00:00.000", "2018-09-01T00:00:00.000", Datas1(Location(x.apply(1).toDouble,x.apply(2).toDouble, ""), x.apply(0) ,x.apply(3).toDouble,x.apply(5),x.apply(4).toDouble,x.apply(7),x.apply(8),x.apply(9)))
val topics2 = Array("topics2")
val stream2 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics2, kafkaParams))
val s2pre = stream2.map(record => record.value.split(",").map(_.trim))
val s2 = s2pre.map(x => Sensors1(x.apply(6), "2016-03-01T00:00:00.000", "2018-09-01T00:00:00.000", Datas1(Location(x.apply(1).toDouble,x.apply(2).toDouble, ""), x.apply(0) ,x.apply(3).toDouble,x.apply(5),x.apply(4).toDouble,x.apply(7),x.apply(8),x.apply(9)))
val j1s1 = s1.map(x => (x.data1.timestamp, (x)))
val j1s2 = s2.map(x => (x.data1.timestamp, (x)))
val j1s1win = j1s1.window(Seconds(3), Seconds(6))
val j1s2win = j1s2.window(Seconds(3), Seconds(6))
val j1pre = j1s1win.join(j1s2win)
case class Sensorj1(sensor_name: String, start_date: String, end_date: String)
val j1 = j1pre.map { r => new Sensorj1("j1", r._2._1.start_date, r._2._1.end_date)}
j1.print()
The problem I have is "KafkaConsumer is not safe for multi-threaded access".
After reading different posts I changed my code by adding cache() at the end of the KafkaConsumer (val stream1 and val stream2).
After that I do not have the same error but I have a serialization error on the string I try to map.
I do not understand and I do not have any idea on how to fix this problem.
Any suggestions?
Thanks
LF

Related

covert Spark rdd[row] to dataframe

I have trouble to transform json to dataframe.
I am trying to use spark for a project to synchronize table to data lake(hudi) from a CDC(canal) listening mysql binlog. Here I received json about row change and add some fields for it. this json steam include multiple schema. each schemahave different colums and may add new column in the future.so I build GenericRowWithSchema for each json and pass individual schema for each row.
Now, I need to tranform rdd[row] to dataframe to write to hudi How I can trans it?
object code{
def main(args: Array[String]): Unit = {
val sss = SparkSession.builder().appName("SparkHudi").getOrCreate()
//val sc = SparkContext.getOrCreate
val sc = sss.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
//ssc.sparkContext.setLogLevel("INFO");
import org.apache.kafka.common.serialization.StringDeserializer
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "kafka.test.com:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.GROUP_ID_CONFIG -> "group-88",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> Boolean.box(true)
)
val topics = Array("test")
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
//Cache your RDD before you perform any heavyweight operations.
kafkaDirectStream.start()
val saveRdd = kafkaDirectStream.map(x => {
//receive json from kafka
val jsonObject = JSON.parse(x.value()).asInstanceOf[JSONObject]
jsonObject
}).filter(json =>{
/*some json field operation*/
val keySets = dataJson.keySet()
val dataArray:ArrayBuffer[AnyRef] = ArrayBuffer[AnyRef]()
val fieldArray:ArrayBuffer[StructField] = ArrayBuffer[StructField]()
keySets.forEach(dataKey=>{
fieldArray.append(
StructField(dataKey,SqlTypeConverter.toSparkType(sqlTypeJson.getIntValue(dataKey))))
dataArray.append(dataJson.get(dataKey));
})
val schema = StructType(fieldArray)
val row = new GenericRowWithSchema(dataArray.toArray, schema).asInstanceOf[Row]
row
})
saveRdd.foreachRDD ( rdd => {
// Get the offset ranges in the RDD
//println(rdd.map(x => x.toJSONString()).toDebugString());
sss.implicits
rdd.collect().foreach(x=>{
println(x.json)
println(x.schema.sql)
})
})

How to turn this simple Spark Streaming code into a Multi threaded one?

I am learning Kafka in Scala. The attached code is just a word count implementation using Kafka and Spark Streaming.
How do I have a separate consumer execution per partition whilst streaming? Please help!
Here is my code:
class ConsumerM(topics: String, bootstrap_server: String, group_name: String) {
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
.setMaster("local[*]")
.set("spark.executor.memory","1g")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrap_server,
ConsumerConfig.GROUP_ID_CONFIG -> group_name,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"auto.offset.reset" ->"earliest")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Assuming your input topic has multiple partitions, then additionally setting local[*] means you'll have one Spark executor per CPU core, and at least one partition can be consumed by each

error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]

I am trying to capture Kafka events (which I am getting in serialised form) using sparkStreaming in Scala.
Here is my code-snippet:
val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"auto.offset.reset" -> "earliest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> groupId,
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
messages.foreachRDD { rdd =>
println(rdd.toDF())
}
ssc.start()
ssc.awaitTermination()
I am getting error message as:
Error:(59, 19) value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] println(rdd.toDF())
toDF comes through DatasetHolder
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits
I haven't replicated it but my guess is that there's no encoder for ConsumerRecord[String, String] so you can either provide one or map it first to something for which an Encoder can be derived (case class or a primitive)
also println within foreachRDD will probably not act the way you want due to the distributed nature of spark

Writing data to cassandra using spark

I have a spark job written in Scala, in which I am just trying to write one line separated by commas, coming from Kafka producer to Cassandra database. But I couldn't call saveToCassandra.
I saw few examples of wordcount where they are writing map structure to Cassandra table with two columns and it seems working fine. But I have many columns and I found that the data structure needs to parallelized.
Here's is the sample of my code:
object TestPushToCassandra extends SparkStreamingJob {
def validate(ssc: StreamingContext, config: Config): SparkJobValidation = SparkJobValid
def runJob(ssc: StreamingContext, config: Config): Any = {
val bp_conf=BpHooksUtils.getSparkConf()
val brokers=bp_conf.get("bp_kafka_brokers","unknown_default")
val input_topics = config.getString("topics.in").split(",").toSet
val output_topic = config.getString("topic.out")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, input_topics)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(","))
val li = words.par
li.saveToCassandra("testspark","table1", SomeColumns("col1","col2","col3"))
li.print()
words.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach{
case x:String=>{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val outMsg=x+" from spark"
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String](output_topic,null,outMsg)
producer.send(message)
}
}
)
)
ssc.start()
ssc.awaitTermination()
}
}
I think it's the syntax of Scala that I am not getting correct.
Thanks in advance.
You need to change your words DStream into something that the Connector can handle.
Like a Tuple
val words = lines
.map(_.split(","))
.map( wordArr => (wordArr(0), wordArr(1), wordArr(2))
or a Case Class
case class YourRow(col1: String, col2: String, col3: String)
val words = lines
.map(_.split(","))
.map( wordArr => YourRow(wordArr(0), wordArr(1), wordArr(2)))
or a CassandraRow
This is because if you place an Array there all by itself it could be an Array in C* you are trying to insert rather than 3 columns.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

Receiving empty data from Kafka - Spark Streaming

Why am I getting empty data messages when I read a topic from kafka?
Is it a problem with the Decoder?
*There is no error or exception.
Code:
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Queue Status")
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint("/tmp/")
val kafkaConfig = Map("zookeeper.connect" -> "ip.internal:2181",
"group.id" -> "queue-status")
val kafkaTopics = Map("queue_status" -> 1)
val kafkaStream = KafkaUtils.createStream[String, QueueStatusMessage, StringDecoder, QueueStatusMessageKafkaDeserializer](
ssc,
kafkaConfig,
kafkaTopics,
StorageLevel.MEMORY_AND_DISK)
kafkaStream.window(Minutes(1),Seconds(10)).print()
ssc.start()
ssc.awaitTermination()
}
The Kafka decoder:
class QueueStatusMessageKafkaDeserializer(props: VerifiableProperties = null) extends Decoder[QueueStatusMessage] {
override def fromBytes(bytes: Array[Byte]): QueueStatusMessage = QueueStatusMessage.parseFrom(bytes)
}
The (empty) result:
-------------------------------------------
Time: 1440010266000 ms
-------------------------------------------
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
Solution:
Just strictly specified the types in the Kafka topic Map:
val kafkaTopics = Map[String, Int]("queue_status" -> 1)
Still don't know the reason for the problem, but the code is working fine now.