overloaded method value createDirectStream with alternatives - scala

My spark version is 1.6.2, And My kafka version is 0.10.1.0. And I want to send a custom object as the kafka value type and I try to push this custom object into the kafka topic. And use spark streaming to read the data. And I'm using Direct approach. The following is my code:
import com.xxxxx.kafka.{KafkaJsonDeserializer, KafkaObjectDecoder, pharmacyData}
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object sparkReadKafka {
val sparkConf = new SparkConf().setAppName("SparkReadKafka")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(1))
def main(args: Array[String]): Unit = {
val kafkaParams = Map[String, Object] (
"bootstrap.servers" -> "kafka.kafka-cluster-shared.non-prod-5-az-scus.prod.us.xxxxx.net:9092",
//"key.deserializer" -> classOf[StringDeserializer],
//"value.deserializer" -> classOf[KafkaJsonDeserializer],
"group.id" -> "consumer-group-2",
"auto.offset.reset" -> "earliest",
"auto.commit.interval.ms" -> "1000",
"enable.auto.commit" -> (false: java.lang.Boolean),
"session.timeout.ms" -> "30000"
)
val topic = "hw_insights"
val stream = KafkaUtils.createDirectStream[String, pharmacyData, StringDecoder, KafkaObjectDecoder](ssc, kafkaParams, Set(topic))
}
}
And the error I got is similar to this(I have to remove some part for security purpose):
Error:(29, 47) overloaded method value createDirectStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyClass: Class[String],valueClass: Class[com.xxxxxxx.kafka.pharmacyData],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[com.xxxxxxx.kafka.KafkaObjectDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Set[String])org.apache.spark.streaming.api.java.JavaPairInputDStream[String,com.xxxxxxx.kafka.pharmacyData]
(ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Set[String])(implicit evidence$19: scala.reflect.ClassTag[String], implicit evidence$20: scala.reflect.ClassTag[com.xxxxxxx.kafka.pharmacyData], implicit evidence$21: scala.reflect.ClassTag[kafka.serializer.StringDecoder], implicit evidence$22: scala.reflect.ClassTag[com.xxxxxxx.kafka.KafkaObjectDecoder])org.apache.spark.streaming.dstream.InputDStream[(String, com.xxxxxxx.kafka.pharmacyData)]
cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,Object], scala.collection.immutable.Set[String])
val stream = KafkaUtils.createDirectStream[String, pharmacyData, StringDecoder, KafkaObjectDecoder](ssc, kafkaParams, Set(topic))
And below is my customer decoder class:
import kafka.serializer.Decoder
import org.codehaus.jackson.map.ObjectMapper
class KafkaObjectDecoder extends Decoder[pharmacyData] {
override def fromBytes(bytes: Array[Byte]): pharmacyData = {
val mapper = new ObjectMapper()
val pdata = mapper.readValue(bytes, classOf[pharmacyData])
pdata
}
}
Can someone please help me with issues? Thannk you!

The error is saying your parameters are incorrect
cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,Object], scala.collection.immutable.Set[String])
The closest method it thinks you want is
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyClass: Class[String],valueClass: Class[com.xxxxxxx.kafka.pharmacyData],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[com.xxxxxxx.kafka.KafkaObjectDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Set[String])

Related

Reading from Kafka with Scala Spark2 Streaming

I need to connect to Kafka and read data from it (after that I have to write in ElasticSearch Database), but for now, I just want to read and print data..
I am newbie with both Kafka and Scala, and reading in internet I have coded this:
//spark
import org.apache.spark._
import org.apache.spark.streaming._
//kafka
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object Main extends App{
val master = "local[2]"
val hostname = ""
val conf = new SparkConf().setAppName("KafkaConnection").setMaster(master)
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "IRC",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
stream.map(record => (record.key, record.value))
val offsetRanges = Array(
// topic, partition, inclusive starting offset, exclusive ending offset
OffsetRange("test", 0, 0, 100),
OffsetRange("test", 1, 0, 100)
)
val rdd = KafkaUtils.createRDD[String, String](
ssc, kafkaParams, offsetRanges, PreferConsistent)
}
But I don't know how to continue. What do I need now? Also, do you know any public Kafka Broker/topic which I can use to read from it?
Thank you in advance!
What do I need now?
Try running the code. spark-submit or run the main method.
do you know any public Kafka Broker/topic which I can use to read from it?
That would be insecure, so no. Start your own brokers locally following Kafka quickstart official guides.
Your code currently reads from a topic called test

error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]

I am trying to capture Kafka events (which I am getting in serialised form) using sparkStreaming in Scala.
Here is my code-snippet:
val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"auto.offset.reset" -> "earliest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> groupId,
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
messages.foreachRDD { rdd =>
println(rdd.toDF())
}
ssc.start()
ssc.awaitTermination()
I am getting error message as:
Error:(59, 19) value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] println(rdd.toDF())
toDF comes through DatasetHolder
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits
I haven't replicated it but my guess is that there's no encoder for ConsumerRecord[String, String] so you can either provide one or map it first to something for which an Encoder can be derived (case class or a primitive)
also println within foreachRDD will probably not act the way you want due to the distributed nature of spark

NotSerializableException with Neo4j Spark Streaming Scala

I am trying to run a query in Neo4j using Neo4j-Spark connector. I want to pass the values from the stream (produced by Kafka as a String) into my query. However, I get serialization exception:
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#54688d9f)
- field (class: consumer.SparkConsumer$$anonfun$processingLogic$2, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class consumer.SparkConsumer$$anonfun$processingLogic$2, <function1>)
- field (class: consumer.SparkConsumer$$anonfun$processingLogic$2$$anonfun$apply$3, name: $outer, type: class consumer.SparkConsumer$$anonfun$processingLogic$2)
- object (class consumer.SparkConsumer$$anonfun$processingLogic$2$$anonfun$apply$3, <function1>)
Here is the code for the main function and querying logic:
object SparkConsumer {
def main(args: Array[String]) {
val config = "neo4j_local"
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("KafkaSparkStreaming")
setNeo4jSparkConfig(config, sparkConf)
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val streamingContext = new StreamingContext(sparkSession.sparkContext, Seconds(3))
streamingContext.sparkContext.setLogLevel("ERROR")
val sqlContext = new SQLContext(streamingContext.sparkContext)
val numStreams = 2
val topics = Array("member_topic1")
def kafkaParams(i: Int) = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group2",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val lines = (1 to numStreams).map(i => KafkaUtils.createDirectStream[String, String](
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams(i))
))
val messages = streamingContext.union(lines)
val wordsArrays = values.map(_.split(","))
wordsArrays.foreachRDD(rdd => rdd.foreach(
data => execNeo4jSearchQuery(data)(streamingContext.sparkContext)
))
streamingContext.start()
streamingContext.awaitTermination()
}
def execNeo4jSearchQuery(data: Array[String])(implicit sc: SparkContext) = {
val neo = Neo4j(sc)
val query = "my query"
val paramsMap = Map("lat" -> data(1).toDouble, "lon" -> data(2).toDouble, "id" -> data(0).toInt)
val df = neo.cypher(query, paramsMap).loadDataFrame("group_name" -> "string", "event_name" -> "string", "venue_name" -> "string", "distance" -> "double")
println("\ndf:")
df.show()
}
}
It is not allowed to access SparkContext, SparkSession or create distrbuted data structures from an executor. Therefore:
wordsArrays.foreachRDD(rdd => rdd.foreach(
data => execNeo4jSearchQuery(data)(streamingContext.sparkContext)
))
where execNeo4jSearchQuery calls:
neo.cypher(query, paramsMap).loadDataFrame
is not valid Spark code.
If you want to access Neo4j directly from RDD.foreach you have to use standard client (AnormCypher seems to provide very elegant API), without conversion to Spark distributed structures.
A bit unrelated note - you might consider using a single connection for the set of records with foreachPartition (also SPARK Cost of Initalizing Database Connection in map / mapPartitions context).

Wrong number of type parameters for overload function createDirectStream

I am new to spark scala and while trying to run this simple code which tries to read from a kafka topic, I am bogged down by an error while creating direct stream suggest I am providing wrong number of type parameters for overload function createDirectStream. Below is the line where I am getting error
val messages = KafkaUtils.createDirectStream [String, String, StringDecoder, StringDecoder]
(streamingContext, kafkaParams, topicsSet)
And below is the full code.
package com.test.spark
import java.util.Properties
import org.apache.spark
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka010._
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object KafkaAirDRsProcess {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("AirDR Kafka to Spark")
val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, Seconds(10))
// Create direct kafka stream with brokers and topics
val brokers = "10.21.165.145:6667 "
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val topics="AIRMAIN , dummy"
val topicsSet = topics.split(",").toSet
//val topicsSet=topics.map(_.toString).toSet
val messages = KafkaUtils.createDirectStream [String, String, StringDecoder, StringDecoder]
(streamingContext, kafkaParams, topicsSet)
val LinesDStream = messages.map(_._2)
val AirDRStream= LinesDStream.map(AirDRFilter.parseAirDR)
AirDRStream.foreachRDD(foreachFunc = rdd => {
System.out.println("--- New RDD with " + rdd.count() + " records");
if (rdd.count() > 0) {
rdd.toDF().registerTempTable("AirDRTemp")
val FilteredCDR = sqlContext.sql("select * from AirDRTemp" )
println("======================print result =================")
FilteredCDR.show()
}
});
//streamingContext.checkpoint("/tmp/mytest/ckpt/")
streamingContext.start()
streamingContext.awaitTermination()
}
}
Below is the snapshot of intellij error
Since you are using kafka-0-10 you can create InputDStream as below
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "10.21.165.145:6667:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = ???
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Hope this helps!

do not want string as type when using foreach in scala spark streaming?

code snippet :
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
write2hdfs.foreachRDD(rdd => {
rdd.foreach(avroRecord => {
println(avroRecord)
//val rawByte = avroRecord.getBytes("UTF-8")
Issue faced>
avroRecord holds avro encoded messages received from kafka stream.
By default avroRecord is a string when the above code is being used.
And string has UTF-16 encoding as default in scala.
Due this deserialization is not correct and facing issues.
Messages were encoded into avro with utf-8 when sent to kafka stream.
I would need avroRecord to be pure bytes instead of getting as string and then converting to bytes(internally string would do utf-16 encoding).
or a way to get avroRecord itself in utf-8. Stuck here deadblock.
Need a way forward for this problem statement.
Thanks in advance.
UPDATE:
Code snippet changed >
val ssc = new StreamingContext(sparkConf, Seconds(5))
//val ssc = new JavaStreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val kafkaParams = Map[String, String]("zookeeper.connect" ->
zkQuorum,"group.id" -> group,"zookeeper.connection.timeout.ms" -> "10000")
//val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
val lines =
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topics,StorageLevel.NONE)
imports done :
import org.apache.spark.streaming._
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
import org.apache.avro
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericDatumReader, GenericRecord,
GenericDatumWriter, GenericData}
import org.apache.avro.io.{DecoderFactory, DatumReader, DatumWriter,
BinaryDecoder}
import org.apache.avro.file.{DataFileReader, DataFileWriter}
import java.io.{File, IOException}
//import java.io.*
import org.apache.commons.io.IOUtils;
import _root_.kafka.serializer.{StringDecoder, DefaultDecoder}
import _root_.kafka.message.Message
import scala.reflect._
Compilation error :
Compiling 1 Scala source to /home/spark_scala/spark_stream_project/target/scala-2.10/classes...
[error] /home/spark_scala/spark_stream_project/src/main/scala/sparkStreaming.scala:34: overloaded method value createStream with alternatives:
[error] (jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyTypeClass: Class[String],valueTypeClass: Class[kafka.message.Message],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[kafka.serializer.DefaultDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Map[String,Integer],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream[String,kafka.message.Message]
[error] (ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Map[String,Int],storageLevel: org.apache.spark.storage.StorageLevel)(implicit evidence$1: scala.reflect.ClassTag[String], implicit evidence$2: scala.reflect.ClassTag[kafka.message.Message], implicit evidence$3: scala.reflect.ClassTag[kafka.serializer.StringDecoder], implicit evidence$4: scala.reflect.ClassTag[kafka.serializer.DefaultDecoder])org.apache.spark.streaming.dstream.ReceiverInputDStream[(String, kafka.message.Message)]
[error] cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,String], String, org.apache.spark.storage.StorageLevel)
[error] val lines = KafkaUtils.createStreamString,Message,StringDecoder,DefaultDecoder
[error] ^
[error] one error found
What is wrong here.
Also, i dont see the correct constructor as suggested being defined in the kafkaUtils API doc.
API Doc ref am referring :
https://spark.apache.org/docs/1.3.0/api/java/index.html?
org/apache/spark/streaming/kafka/KafkaUtils.html
looking forward for support.
Thanks.
UPDATE 2:
Tried with corrections suggested!
code snippet>
val lines =
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
Facing runtime exception>
java.lang.ClassCastException: [B cannot be cast to kafka.message.Message
On line :
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
ideally filter this Dstream(String,Message) should also work right ?
Do i need to extract the payload from Message before subjecting to map ?
need inputs please.
Thanks
You could do something like this:
import kafka.serializer.{StringDecoder, DefaultDecoder}
import kafka.message.Message
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> group,
"zookeeper.connection.timeout.ms" -> "10000")
val lines = KafkaUtils.createStream[String, Message, StringDecoder, DefaultDecoder](
ssc, kafkaParams, topics, storageLevel)
This should get you a DStream[(String, kafka.message.Message)], and you should be able to retrieve the raw bytes and convert to Avro from there.
This worked for me :
val lines =
KafkaUtils.createStream[String,Array[Byte],StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
My requirement was to get the Byte Array, so changed to Array[Byte] instead of kafka.message.Message