I want to consume messages from Kafka topic using Scala 2.10.6 and Spark 1.6.2. For Kafka I am using this dependency:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.2</version>
</dependency>
This code compiles fine, however I want to define auto.offset.reset and here the problem arises:
val topicMap = topic.split(",").map((_, kafkaNumThreads.toInt)).toMap
val data = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap,
StorageLevel.MEMORY_AND_DISK_SER_2).map(_._2)
When I add kafkaParams, it does not compile anymore:
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> group,
"zookeeper.connection.timeout.ms" -> "10000",
"auto.offset.reset" -> "smallest")
val data = KafkaUtils.createStream(ssc, kafkaParams, topicMap,
StorageLevel.MEMORY_AND_DISK_SER_2).map(_._2)
Error message:
94: error: missing parameter type for expanded function ((x$3) => x$3._2)
[ERROR] StorageLevel.MEMORY_AND_DISK_SER_2).map(_._2)
I tried many different cominations of parameters of createStream, but everything fails. Can someone help please?
You need to add type parameters to KafkaUtils.createStream for it to resolve the underlying types of the stream. For example, if your key and value are of type String:
val data: DStream[String] =
KafkaUtils
.createStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicMap,
StorageLevel.MEMORY_AND_DISK_SER_2
).map(_._2)
Related
I am trying to capture Kafka events (which I am getting in serialised form) using sparkStreaming in Scala.
Here is my code-snippet:
val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"auto.offset.reset" -> "earliest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> groupId,
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
messages.foreachRDD { rdd =>
println(rdd.toDF())
}
ssc.start()
ssc.awaitTermination()
I am getting error message as:
Error:(59, 19) value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] println(rdd.toDF())
toDF comes through DatasetHolder
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits
I haven't replicated it but my guess is that there's no encoder for ConsumerRecord[String, String] so you can either provide one or map it first to something for which an Encoder can be derived (case class or a primitive)
also println within foreachRDD will probably not act the way you want due to the distributed nature of spark
I am getting error CDRS.toDF() error
case class CDR(phone:String, first_type:String,in_out:String,local:String,duration:String,date:String,time:String,roaming:String,amount:String,in_network:String,is_promo:String,toll_free:String,bytes:String,last_type:String)
// Create direct Kafka stream with brokers and topics
//val topicsSet = Set[String] (kafka_topic)
val topicsSet = Set[String] (kafka_topic)
val kafkaParams = Map[String, String]("metadata.broker.list" ->
kafka_broker)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
//===============================================================================================
//Apply Schema Of Class CDR to Message Coming From Kafka
val CDRS = messages.map(_.split('|')).map(x=> CDR
(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13).repla ceAll("\n","")))
I am playing with Spark Streaming and Kafka (with the Scala API), and would like to read message from a set of Kafka topics with Spark Streaming.
The following method:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "smallest")
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
reads from Kafka to the latest available offset, but doesn't give me the metadata that I need (since I am reading from a set of topics, I need for every message I read that topic) but this other method KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple2[String, String]](ssc, kafkaParams, currentOffsets, messageHandler) wants explicitly an offset that I don't have.
I know that there is this shell command that gives you the last offset.
kafka-run-class.sh kafka.tools.GetOffsetShell
--broker-list <broker>: <port>
--topic <topic-name> --time -1 --offsets 1
and KafkaCluster.scala is an API that is for developers that used to be public and gives you exactly what I would like.
Hint?
You can use the code from GetOffsetShell.scala kafka API documentation
val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
val topicAndPartition = TopicAndPartition(topic, partitionId)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, nOffsets)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
Or you can create new consumer with unique groupId and use it for getting first offset
val consumer=new KafkaConsumer[String, String](createConsumerConfig(config.brokerList))
consumer.partitionsFor(config.topic).foreach(pi => {
val topicPartition = new TopicPartition(pi.topic(), pi.partition())
consumer.assign(List(topicPartition))
consumer.seekToBeginning()
val firstOffset = consumer.position(topicPartition)
...
I have set up the Spark-Kafka Consumer in Scala that receives messages from multiple topics:
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(10))
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> properties.getProperty("group_id"),
"auto.offset.reset" -> properties.getProperty("offset_reset")
)
// Kafka integration with receiver
val msgStream = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc, kafkaParams, Map(properties.getProperty("topic1") -> 1,
properties.getProperty("topic2") -> 2,
properties.getProperty("topic3") -> 3),
StorageLevel.MEMORY_ONLY_SER).map(_._2)
I need to develop corresponding action code for messages (which will be in JSON format) from each topic.
I referred to the following question, but the answer in it didn't help me:
get topic from Kafka message in spark
So, is there any method on the received DStream that can be used to fetch topic name along with the message to determine what action should take place?
Any help on this would be greatly appreciated. Thank you.
See the code below.
You can get topic name and message by foreachRDD, map operation on DStream.
msgStream.foreachRDD(rdd => {
val pairRdd = rdd.map(i => (i.topic(), i.value()))
})
The code below is an example source of createDirectStream that I am using.
val ssc = new StreamingContext(configLoader.sparkConfig, Seconds(conf.getInt(Conf.KAFKA_PULL_INTERVAL)))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString(Conf.KAFKA_BOOTSTRAP_SERVERS),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> conf.getString(Conf.KAFKA_CONSUMER_GID),
"auto.offset.reset" -> conf.getString(Conf.KAFKA_AUTO_OFFSET_RESET),
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics: Array[String] = conf.getString(Conf.KAFKA_TOPICS).split(",")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
If I feed my Spark-Streaming Application more than one topic like so:
val ssc = new StreamingContext(sc, Seconds(2))
val topics = Set("raw_1", "raw_2)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
When I run my application how can I figure out from my stream what the difference between which topic it is pulling from? Is there a way to do? If I do something like
val lines = stream.print()
I am getting nothing of differentiation. Is the only way to do it to make the Kafka Message Key a denoting factor?
Yes, you can use MessageAndMetadata version of createDirectStream which allows you to access message metadata.
You can find example implementation here .