Error trying to create direct stream between Spark and Kafka - scala

I am trying to follow this guide to enable my Spark shell to stream data from a Kafka topic http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
In my Spark shell I go to run this code.
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "testid",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("my_topic")
topics.map(_.toString).toSet
val stream = KafkaUtils.createDirectStream[String, String](
sc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
It seems to work up until the createDirectStream method. At that point I am getting this error.
scala> val stream = KafkaUtils.createDirectStream[String, String](
| sc,
| PreferConsistent,
| Subscribe[String, String](topics, kafkaParams)
| )
<console>:35: error: overloaded method value createDirectStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String],perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.api.java.JavaInputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] <and>
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String])org.apache.spark.streaming.api.java.JavaInputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] <and>
(ssc: org.apache.spark.streaming.StreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String],perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.dstream.InputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] <and>
(ssc: org.apache.spark.streaming.StreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String])org.apache.spark.streaming.dstream.InputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]
cannot be applied to (org.apache.spark.SparkContext, org.apache.spark.streaming.kafka010.LocationStrategy, org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String])
val stream = KafkaUtils.createDirectStream[String, String](
^

Related

Problem integrating kafka and spark streaming no messages received in spark streaming

My spark streaming context is successfully subscribed to my kafka topic where my tweets are streamed using my twitter producer.But no messages is being streamed from topic in my spark streaming!
Here is my code
def main(args: Array[String]){
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "127.0.0.1:9093",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "earliest", //earliest/latest
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val sparkConf = new SparkConf().setAppName("StreamTweets").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topics = List("topic_three")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
stream.map(record => (record.key, record.value))
stream.print()
stream.foreachRDD { rdd =>
rdd.foreach { record =>
val value = record.value()
val tweet = scala.util.parsing.json.JSON.parseFull(value)
val map:Map[String,Any] = tweet.get.asInstanceOf[Map[String, Any]]
println(map.get("text"))
}
}
ssc.start()
ssc.awaitTermination()

scala: how to rectify "option" type after leftOuterJoin

Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".
scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.

Using KeyValueGroupedDataset cogroup in spark

I would like to use cogroup method on KeyValueGroupedDataset in spark. Here is a scala attempt but getting an error:
import org.apache.spark.sql.functions._
val x1 = Seq(("a", 36), ("b", 33), ("c", 40), ("a", 38), ("c", 39)).toDS
val g1 = x1.groupByKey(_._1)
val x2 = Seq(("a", "ali"), ("b", "bob"), ("c", "celine"), ("a", "amin"), ("c", "cecile")).toDS
val g2 = x2.groupByKey(_._1)
val cog = g1.cogroup(g2, (k: Long, iter1:Iterator[(String, Int)], iter2:Iterator[(String, String)]) => iter1);
But getting an error:
<console>:34: error: overloaded method value cogroup with alternatives:
[U, R](other: org.apache.spark.sql.KeyValueGroupedDataset[String,U], f: org.apache.spark.api.java.function.CoGroupFunction[String,(String, Int),U,R], encoder: org.apache.spark.sql.Encoder[R])org.apache.spark.sql.Dataset[R] <and>
[U, R](other: org.apache.spark.sql.KeyValueGroupedDataset[String,U])(f: (String, Iterator[(String, Int)], Iterator[U]) => TraversableOnce[R])(implicit evidence$11: org.apache.spark.sql.Encoder[R])org.apache.spark.sql.Dataset[R]
cannot be applied to (org.apache.spark.sql.KeyValueGroupedDataset[String,(String, String)], (Long, Iterator[(String, Int)], Iterator[(String, String)]) => Iterator[(String, Int)])
val cog = g1.cogroup(g2, (k: Long, iter1:Iterator[(String, Int)], iter2:Iterator[(String, String)]) => iter1);
I am getting same error in JAVA.
cogroup you are trying to use is curried so you have to call it separately for the dataset and the function. There is also type mismatch in the key type:
g1.cogroup(g2)(
(k: String, it1: Iterator[(String, Int)], it2: Iterator[(String, String)]) =>
it1)
or just:
g1.cogroup(g2)((_, it1, _) => it1)
In Java, I'd use CoGroupFunction variant:
import org.apache.spark.api.java.function.CoGroupFunction;
import org.apache.spark.sql.Encoders;
g1.cogroup(
g2,
(CoGroupFunction<String, Tuple2<String, Integer>, Tuple2<String, String>, Tuple2<String, Integer>>) (key, it1, it2) -> it1,
Encoders.tuple(Encoders.STRING(), Encoders.INT()));
where g1 and g2 are KeyValueGroupedDataset<String, Tuple2<String, Integer> and KeyValueGroupedDataset<String, Tuple2<String, String>> respectively.

Why does Spark/Scala compiler fail to find toDF on RDD[Map[Int, Int]]?

Why does the following end up with an error?
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> val rdd = sc.parallelize(1 to 10).map(x => (Map(x -> 0), 0))
rdd: org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[Int,Int], Int)] = MapPartitionsRDD[20] at map at <console>:27
scala> rdd.toDF
res8: org.apache.spark.sql.DataFrame = [_1: map<int,int>, _2: int]
scala> val rdd = sc.parallelize(1 to 10).map(x => Map(x -> 0))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = MapPartitionsRDD[23] at map at <console>:27
scala> rdd.toDF
<console>:30: error: value toDF is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]]
rdd.toDF
So what exactly is happening here, toDF can convert RDD of type (scala.collection.immutable.Map[Int,Int], Int) to DataFrame but not of type scala.collection.immutable.Map[Int,Int]. Why is that?
For the same reason why you cannot use
sqlContext.createDataFrame(1 to 10).map(x => Map(x -> 0))
If you take a look at the org.apache.spark.sql.SQLContext source you'll find two different implementations of the createDataFrame method:
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
and
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
As you can see both require A to be a subclass of Product. When you call toDF on a RDD[(Map[Int,Int], Int)] it works because Tuple2 is indeed a Product. Map[Int,Int] by itself is not hence the error.
You can make it work by wrapping Map with Tuple1:
sc.parallelize(1 to 10).map(x => Tuple1(Map(x -> 0))).toDF
Basically because there is no implicit to create a DataFrame for a Map inside an RDD.
In you first example you are returning a Tuple, which is a Product for which there is an implicit conversion.
rddToDataFrameHolder[A <: Product : TypeTag](rdd: RDD[A])
In the second example you use have a Map in your RDD, for which there is no implicit conversion.

Spark pairRDD not working

value subtractByKey is not a member of
org.apache.spark.rdd.RDD[(String, LabeledPoint)]
value join is not a member of org.apache.spark.rdd.RDD[(String,
LabeledPoint)]
How come this is happening? org.apache.spark.rdd.RDD[(String, LabeledPoint)] is pair-value RDD and I already imported import org.apache.spark.rdd._
In the spark-shell, this works exactly as expected, without having to import anything:
scala> case class LabeledPoint(x: Int, y: Int, label: String)
defined class LabeledPoint
scala> val rdd1 = sc.parallelize(List("this","is","a","test")).map(label => (label, LabeledPoint(0,0,label)))
rdd1: org.apache.spark.rdd.RDD[(String, LabeledPoint)] = MapPartitionsRDD[1] at map at <console>:23
scala> val rdd2 = sc.parallelize(List("this","is","a","test")).map(label => (label, 1))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:21
scala> rdd1.join(rdd2)
res0: org.apache.spark.rdd.RDD[(String, (LabeledPoint, Int))] = MapPartitionsRDD[6] at join at <console>:28