Sparkstreaming + Kafka to hdfs - scala

When I try to consume the message from kafka topic using spark streaming getting the below error
scala> val kafkaStream = KafkaUtils.createStream(ssc, "<ipaddress>:2181","spark-streaming-consumer-group", Map("test1" -> 5))
Error:
`missing or invalid dependency detected while loading class file 'KafkaUtils.class'.
Could not access term kafka in package <root>,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'KafkaUtils.class' was compiled against an incompatible version of <root>.`
Scala version: 2.11.8
spark version: 2.1.0.2.6.0.3-8
I have used all kind of library for spark-streaming-kafka but nothing worked:
I am executing the code from the spark shell:
./spark-shell --jars /data/home/local/504/spark-streaming-kafka_2.10-1.5.1.jar, /data/home/local/504/spark-streaming_2.10-1.5.1.jar
Code
import org.apache.spark.SparkConf
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
val ssc = new StreamingContext(conf, Seconds(10))
import org.apache.spark.streaming.kafka.KafkaUtils
val kafkaStream = KafkaUtils.createStream(ssc, "<ipaddress>:2181","spark-streaming-consumer-group", Map("test1" -> 5))
Any suggestion for this issue.

Since you are using Scala 2.11 and spark 2.1.0 you should be using these jars
spark-streaming-kafka-0-10_2.11-2.1.0.jar
spark-streaming_2.11-2.1.0.jar
If you are using Kafka 0.10+ otherwise change it accordingly.
And the simple program would look like
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.kafka.common.serialization.StringDeserializer
val streamingContext = new StreamingContext(sc, Seconds(5))
//Parameters for kafka
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "servers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test-consumer-group",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = "topics,seperated,by,comma".split(",")
// crate dstreams
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
//stream.print()
stream.map(_.value().toString).print()
Hope this hepls!

Related

overloaded method value createDirectStream with alternatives

My spark version is 1.6.2, And My kafka version is 0.10.1.0. And I want to send a custom object as the kafka value type and I try to push this custom object into the kafka topic. And use spark streaming to read the data. And I'm using Direct approach. The following is my code:
import com.xxxxx.kafka.{KafkaJsonDeserializer, KafkaObjectDecoder, pharmacyData}
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object sparkReadKafka {
val sparkConf = new SparkConf().setAppName("SparkReadKafka")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(1))
def main(args: Array[String]): Unit = {
val kafkaParams = Map[String, Object] (
"bootstrap.servers" -> "kafka.kafka-cluster-shared.non-prod-5-az-scus.prod.us.xxxxx.net:9092",
//"key.deserializer" -> classOf[StringDeserializer],
//"value.deserializer" -> classOf[KafkaJsonDeserializer],
"group.id" -> "consumer-group-2",
"auto.offset.reset" -> "earliest",
"auto.commit.interval.ms" -> "1000",
"enable.auto.commit" -> (false: java.lang.Boolean),
"session.timeout.ms" -> "30000"
)
val topic = "hw_insights"
val stream = KafkaUtils.createDirectStream[String, pharmacyData, StringDecoder, KafkaObjectDecoder](ssc, kafkaParams, Set(topic))
}
}
And the error I got is similar to this(I have to remove some part for security purpose):
Error:(29, 47) overloaded method value createDirectStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyClass: Class[String],valueClass: Class[com.xxxxxxx.kafka.pharmacyData],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[com.xxxxxxx.kafka.KafkaObjectDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Set[String])org.apache.spark.streaming.api.java.JavaPairInputDStream[String,com.xxxxxxx.kafka.pharmacyData]
(ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Set[String])(implicit evidence$19: scala.reflect.ClassTag[String], implicit evidence$20: scala.reflect.ClassTag[com.xxxxxxx.kafka.pharmacyData], implicit evidence$21: scala.reflect.ClassTag[kafka.serializer.StringDecoder], implicit evidence$22: scala.reflect.ClassTag[com.xxxxxxx.kafka.KafkaObjectDecoder])org.apache.spark.streaming.dstream.InputDStream[(String, com.xxxxxxx.kafka.pharmacyData)]
cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,Object], scala.collection.immutable.Set[String])
val stream = KafkaUtils.createDirectStream[String, pharmacyData, StringDecoder, KafkaObjectDecoder](ssc, kafkaParams, Set(topic))
And below is my customer decoder class:
import kafka.serializer.Decoder
import org.codehaus.jackson.map.ObjectMapper
class KafkaObjectDecoder extends Decoder[pharmacyData] {
override def fromBytes(bytes: Array[Byte]): pharmacyData = {
val mapper = new ObjectMapper()
val pdata = mapper.readValue(bytes, classOf[pharmacyData])
pdata
}
}
Can someone please help me with issues? Thannk you!
The error is saying your parameters are incorrect
cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,Object], scala.collection.immutable.Set[String])
The closest method it thinks you want is
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyClass: Class[String],valueClass: Class[com.xxxxxxx.kafka.pharmacyData],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[com.xxxxxxx.kafka.KafkaObjectDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Set[String])

Reading from Kafka with Scala Spark2 Streaming

I need to connect to Kafka and read data from it (after that I have to write in ElasticSearch Database), but for now, I just want to read and print data..
I am newbie with both Kafka and Scala, and reading in internet I have coded this:
//spark
import org.apache.spark._
import org.apache.spark.streaming._
//kafka
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object Main extends App{
val master = "local[2]"
val hostname = ""
val conf = new SparkConf().setAppName("KafkaConnection").setMaster(master)
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "IRC",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
stream.map(record => (record.key, record.value))
val offsetRanges = Array(
// topic, partition, inclusive starting offset, exclusive ending offset
OffsetRange("test", 0, 0, 100),
OffsetRange("test", 1, 0, 100)
)
val rdd = KafkaUtils.createRDD[String, String](
ssc, kafkaParams, offsetRanges, PreferConsistent)
}
But I don't know how to continue. What do I need now? Also, do you know any public Kafka Broker/topic which I can use to read from it?
Thank you in advance!
What do I need now?
Try running the code. spark-submit or run the main method.
do you know any public Kafka Broker/topic which I can use to read from it?
That would be insecure, so no. Start your own brokers locally following Kafka quickstart official guides.
Your code currently reads from a topic called test

error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]

I am trying to capture Kafka events (which I am getting in serialised form) using sparkStreaming in Scala.
Here is my code-snippet:
val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"auto.offset.reset" -> "earliest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> groupId,
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
messages.foreachRDD { rdd =>
println(rdd.toDF())
}
ssc.start()
ssc.awaitTermination()
I am getting error message as:
Error:(59, 19) value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] println(rdd.toDF())
toDF comes through DatasetHolder
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits
I haven't replicated it but my guess is that there's no encoder for ConsumerRecord[String, String] so you can either provide one or map it first to something for which an Encoder can be derived (case class or a primitive)
also println within foreachRDD will probably not act the way you want due to the distributed nature of spark

Spark-shell Error object map is not a member of package org.apache.spark.streaming.rdd

I am trying to read json and parseout two values valueStr1 and valueStr2 from a Kafka topic KafkaStreamTestTopic1 using spark streaming. And convert it to a data frame for further processing.
I am running the code in a spark-shell so spark context sc is available.
But when I run this script, it is giving me the following error:
error: object map is not a member of package org.apache.spark.streaming.rdd
val dfa = rdd.map(record => {
Below is the script used:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.{SparkConf, TaskContext}
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
import org.apache.kafka.common.serialization.StringDeserializer
import play.api.libs.json._
import org.apache.spark.sql._
val ssc = new StreamingContext(sc, Seconds(5))
val sparkSession = SparkSession.builder().appName("myApp").getOrCreate()
val sqlContext = new SQLContext(sc)
// Create direct kafka stream with brokers and topics
val topicsSet = Array("KafkaStreamTestTopic1").toSet
// Set kafka Parameters
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> "my_group",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> "false"
)
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val lines = stream.map(_.value)
lines.print()
case class MyObj(val one: JsValue)
lines.foreachRDD(rdd => {
println("Debug Entered")
import sparkSession.implicits._
import sqlContext.implicits._
val dfa = rdd.map(record => {
implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
val json: JsValue = Json.parse(record)
val value1 = (json \ "root" \ "child1" \ "child2" \ "valueStr1").getOrElse(null)
val value2 = (json \ "root" \ "child1" \ "child2" \ "valueStr2").getOrElse(null)
(new MyObj(value1), new MyObj(value2))
}).toDF()
dfa.show()
println("Dfa Size is: " + dfa.count())
})
ssc.start()
I suppose the problem is that rdd is also a package (org.apache.spark.streaming.rdd) that you imported automatically with the line:
import org.apache.spark.streaming._
To avoid those kind of clashes, rename your variable to something else, for example myRdd:
lines.foreachRDD(myRdd => { /* ... */ })
Add the dependency of spark-streaming into your build manager
"org.apache.spark" %% "spark-mllib" % SparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" %
"2.0.1"
You can use maven or SBT to add during build.

Wrong number of type parameters for overload function createDirectStream

I am new to spark scala and while trying to run this simple code which tries to read from a kafka topic, I am bogged down by an error while creating direct stream suggest I am providing wrong number of type parameters for overload function createDirectStream. Below is the line where I am getting error
val messages = KafkaUtils.createDirectStream [String, String, StringDecoder, StringDecoder]
(streamingContext, kafkaParams, topicsSet)
And below is the full code.
package com.test.spark
import java.util.Properties
import org.apache.spark
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka010._
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object KafkaAirDRsProcess {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("AirDR Kafka to Spark")
val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, Seconds(10))
// Create direct kafka stream with brokers and topics
val brokers = "10.21.165.145:6667 "
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val topics="AIRMAIN , dummy"
val topicsSet = topics.split(",").toSet
//val topicsSet=topics.map(_.toString).toSet
val messages = KafkaUtils.createDirectStream [String, String, StringDecoder, StringDecoder]
(streamingContext, kafkaParams, topicsSet)
val LinesDStream = messages.map(_._2)
val AirDRStream= LinesDStream.map(AirDRFilter.parseAirDR)
AirDRStream.foreachRDD(foreachFunc = rdd => {
System.out.println("--- New RDD with " + rdd.count() + " records");
if (rdd.count() > 0) {
rdd.toDF().registerTempTable("AirDRTemp")
val FilteredCDR = sqlContext.sql("select * from AirDRTemp" )
println("======================print result =================")
FilteredCDR.show()
}
});
//streamingContext.checkpoint("/tmp/mytest/ckpt/")
streamingContext.start()
streamingContext.awaitTermination()
}
}
Below is the snapshot of intellij error
Since you are using kafka-0-10 you can create InputDStream as below
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "10.21.165.145:6667:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = ???
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Hope this helps!