Problem integrating kafka and spark streaming no messages received in spark streaming - scala

My spark streaming context is successfully subscribed to my kafka topic where my tweets are streamed using my twitter producer.But no messages is being streamed from topic in my spark streaming!
Here is my code
def main(args: Array[String]){
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "127.0.0.1:9093",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "earliest", //earliest/latest
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val sparkConf = new SparkConf().setAppName("StreamTweets").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topics = List("topic_three")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
stream.map(record => (record.key, record.value))
stream.print()
stream.foreachRDD { rdd =>
rdd.foreach { record =>
val value = record.value()
val tweet = scala.util.parsing.json.JSON.parseFull(value)
val map:Map[String,Any] = tweet.get.asInstanceOf[Map[String, Any]]
println(map.get("text"))
}
}
ssc.start()
ssc.awaitTermination()

Related

scala - map & flatten shows different result than flatMap

val adjList = Map("Logging" -> List("Networking", "Game"))
// val adjList: Map[String, List[String]] = Map(Logging -> List(Networking, Game))
adjList.flatMap { case (v, vs) => vs.map(n => (v, n)) }.toList
// val res7: List[(String, String)] = List((Logging,Game))
adjList.map { case (v, vs) => vs.map(n => (v, n)) }.flatten.toList
// val res8: List[(String, String)] = List((Logging,Networking), (Logging,Game))
I am not sure what is happening here. I was expecting the same result from both of them.
.flatMap is Map's .flatMap, but .map is Iterable's .map.
For a Map "Logging" -> "Networking" and "Logging" -> "Game" become just the latter "Logging" -> "Game" because the keys are the same.
val adjList: Map[String, List[String]] = Map("Logging" -> List("Networking", "Game"))
val x0: Map[String, String] = adjList.flatMap { case (v, vs) => vs.map(n => (v, n)) }
//Map(Logging -> Game)
val x: List[(String, String)] = x0.toList
//List((Logging,Game))
val adjList: Map[String, List[String]] = Map("Logging" -> List("Networking", "Game"))
val y0: immutable.Iterable[List[(String, String)]] = adjList.map { case (v, vs) => vs.map(n => (v, n)) }
//List(List((Logging,Networking), (Logging,Game)))
val y1: immutable.Iterable[(String, String)] = y0.flatten
//List((Logging,Networking), (Logging,Game))
val y: List[(String, String)] = y1.toList
//List((Logging,Networking), (Logging,Game))
Also https://users.scala-lang.org/t/map-flatten-flatmap/4180

Spark: Intersection between Key-Value pair and Key RDD

I have two RDDs; rdd1 = RDD[(String, Array[String])] and rdd2 = RDD[String].
I want to remove all rdd1's where the Key is not found in rdd2.
Thank you in advance!
You can make an inner join but first you have to make the second RDD be pair rdd.
val rdd1: RDD[(String, Array[String])] = ???
val rdd2: RDD[String] = ???
val asPairRdd: RDD[(String, Unit)] = rdd2.map(s => (s, ()))
val res: RDD[(String, Array[String])] = rdd1.join(asPairRdd).map{
case (k, (v, dummy)) => (k, v)
}

Using KeyValueGroupedDataset cogroup in spark

I would like to use cogroup method on KeyValueGroupedDataset in spark. Here is a scala attempt but getting an error:
import org.apache.spark.sql.functions._
val x1 = Seq(("a", 36), ("b", 33), ("c", 40), ("a", 38), ("c", 39)).toDS
val g1 = x1.groupByKey(_._1)
val x2 = Seq(("a", "ali"), ("b", "bob"), ("c", "celine"), ("a", "amin"), ("c", "cecile")).toDS
val g2 = x2.groupByKey(_._1)
val cog = g1.cogroup(g2, (k: Long, iter1:Iterator[(String, Int)], iter2:Iterator[(String, String)]) => iter1);
But getting an error:
<console>:34: error: overloaded method value cogroup with alternatives:
[U, R](other: org.apache.spark.sql.KeyValueGroupedDataset[String,U], f: org.apache.spark.api.java.function.CoGroupFunction[String,(String, Int),U,R], encoder: org.apache.spark.sql.Encoder[R])org.apache.spark.sql.Dataset[R] <and>
[U, R](other: org.apache.spark.sql.KeyValueGroupedDataset[String,U])(f: (String, Iterator[(String, Int)], Iterator[U]) => TraversableOnce[R])(implicit evidence$11: org.apache.spark.sql.Encoder[R])org.apache.spark.sql.Dataset[R]
cannot be applied to (org.apache.spark.sql.KeyValueGroupedDataset[String,(String, String)], (Long, Iterator[(String, Int)], Iterator[(String, String)]) => Iterator[(String, Int)])
val cog = g1.cogroup(g2, (k: Long, iter1:Iterator[(String, Int)], iter2:Iterator[(String, String)]) => iter1);
I am getting same error in JAVA.
cogroup you are trying to use is curried so you have to call it separately for the dataset and the function. There is also type mismatch in the key type:
g1.cogroup(g2)(
(k: String, it1: Iterator[(String, Int)], it2: Iterator[(String, String)]) =>
it1)
or just:
g1.cogroup(g2)((_, it1, _) => it1)
In Java, I'd use CoGroupFunction variant:
import org.apache.spark.api.java.function.CoGroupFunction;
import org.apache.spark.sql.Encoders;
g1.cogroup(
g2,
(CoGroupFunction<String, Tuple2<String, Integer>, Tuple2<String, String>, Tuple2<String, Integer>>) (key, it1, it2) -> it1,
Encoders.tuple(Encoders.STRING(), Encoders.INT()));
where g1 and g2 are KeyValueGroupedDataset<String, Tuple2<String, Integer> and KeyValueGroupedDataset<String, Tuple2<String, String>> respectively.

Error trying to create direct stream between Spark and Kafka

I am trying to follow this guide to enable my Spark shell to stream data from a Kafka topic http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
In my Spark shell I go to run this code.
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "testid",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("my_topic")
topics.map(_.toString).toSet
val stream = KafkaUtils.createDirectStream[String, String](
sc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
It seems to work up until the createDirectStream method. At that point I am getting this error.
scala> val stream = KafkaUtils.createDirectStream[String, String](
| sc,
| PreferConsistent,
| Subscribe[String, String](topics, kafkaParams)
| )
<console>:35: error: overloaded method value createDirectStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String],perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.api.java.JavaInputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] <and>
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String])org.apache.spark.streaming.api.java.JavaInputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] <and>
(ssc: org.apache.spark.streaming.StreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String],perPartitionConfig: org.apache.spark.streaming.kafka010.PerPartitionConfig)org.apache.spark.streaming.dstream.InputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] <and>
(ssc: org.apache.spark.streaming.StreamingContext,locationStrategy: org.apache.spark.streaming.kafka010.LocationStrategy,consumerStrategy: org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String])org.apache.spark.streaming.dstream.InputDStream[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]
cannot be applied to (org.apache.spark.SparkContext, org.apache.spark.streaming.kafka010.LocationStrategy, org.apache.spark.streaming.kafka010.ConsumerStrategy[String,String])
val stream = KafkaUtils.createDirectStream[String, String](
^

Passing sqlContext when testing a method

I have the following test case:
test("check foo") {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlc = new SQLContext(sc)
val res = foo("A", "B")
assert(true)
}
Which checks the following method:
def foo(arg1: String, arg2: String) (implicit sqlContext: SQLContext) : Seq[String] = {
//some other code
}
When running the tests I get the following issue:
Error:(65, 42) could not find implicit value for parameter sqlContext: org.apache.spark.sql.SQLContext
val res = foo("A", "B")
How can I share the SqlContext instance I create in the test method with foo?
Put implicit in front of val sqlc:
implicit val sqlc = new SQLContext(sc)