KafkaUtils API | offset management | Spark Streaming - scala

I am trying to manage kafka offsets for exactly once semantics.
Facing problem while creating a direct stream using offset map as follows :
val fromOffsets : (TopicAndPartition, Long) = TopicAndPartition(metrics_rs.getString(1), metrics_rs.getInt(2)) -> metrics_rs.getLong(3)
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder,(String, String)] (ssc,kafkaParams,fromOffsets,messageHandler)
here,
val messageHandler =
(mmd: MessageAndMetadata[String, String]) => mmd.message.length
And
metrics_rs = metricsStatement.executeQuery("SELECT part,off from metrics.txn_offsets where topic='"+t+''' )
I guess I am doing something wrong with the declaration style...if you could help.
The compilation error says "too many type arguments for createDirectStream"

A couple of things I see that you're doing wrong.
You need to pass a Map[TopicAndPartition, Long], while currently you have a Tuple2[TopicAndPartition, Long]. So you need:
val fromOffsets: Map[TopicAndPartition, Long] =
Map(TopicAndPartition(metrics_rs.getString(1),
metrics_rs.getInt(2)) -> metrics_rs.getLong(3))
You say your return type from createDirectStream is a tuple of type (String, String), yet your messageHandler value is an Int. If you want to return a tuple with key value pairs, you need:
val messageHandler: MessageAndMetadata[String, String] => (String, String) =
(mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message())
After fixing that, this should compile:
val stream = KafkaUtils
.createDirectStream[String, String,
StringDecoder, StringDecoder,
(String, String)] (ssc,
kafkaParams,
fromOffsets,
messageHandler)

Related

error: value split is not a member of (String, String)

I have two RDD's rdd1 and rdd2 of type RDD[String].
I have performed cartesian product on the two RDDs in scala spark
val cartesianproduct = rdd1.cartesian(rdd2)
When I am performing the below code, I am getting an error.
val splitup = cartesianproduct.map(line => line.split(","))
Below is error which I am getting:
error: value split is not a member of (String, String)
Cartesian join returns RDD of Tuple2[String, String] so you have to perform map operation on Tuple2[String, String] not on String, Here is the example how to handle Tuple in map function
val cartesianproduct = rdd1.cartesian(rdd2)
val splitup = cartesianproduct.map{ case (line1, line2) => line1.split(",") ++ line2.split(",")}

Create Json object from Circe where the value can be String or a List

I would like create a Json object with circe where the value can be String or a List, like:
val param = Map[String, Map[String, Object]](
"param_a" -> Map[String, Object](
"param_a1" -> "str_value",
"param_a2" -> List(
Map[String, String](
"param_a2.1" -> "value_2.1",
"param_a2.2" -> "value_2.2")
)
),
However, then If I do
param.asJson
It failed with
Error:(61, 23) could not find implicit value for parameter encoder: io.circe.Encoder[scala.collection.immutable.Map[String,Map[String,Object]]]
.postData(param.asJson.toString().getBytes)
Ok, a Quick fix is use Map[String, Json]
val param = Map[String, Map[String, Json]](
"param_a" -> Map[String, Json](
"param_a1" -> "str_value".asJson,
"param_a2" -> List(
Map[String, String](
"param_a2.1" -> "value_2.1",
"param_a2.2" -> "value_2.2")
).asJson
),
You just need to provide an implicit instance of Encoder in scope for Object. Try with this:
implicit val objEncoder: Encoder[Object] = Encoder.instance {
case x: String => x.asJson
case xs: List[Map[String, String]] => xs.asJson
}
However I would avoid using Object and instead provide an ADT to wrap the two possible cases, that is String and List[Map[String, String]], but that's up to you. Furthermore, in the Scala world, Object is more widely known as AnyRef so if you just want to use Object I suggest you call it AnyRef.
P.S.: If you're using a Scala version >= 2.12.0 you can avoid typing Encoder.instance thanks to SAM conversion in overloading resolution. So the code would become:
implicit val objEncoder: Encoder[Object] = {
case x: String => x.asJson
case xs: List[Map[String, String]] => xs.asJson
}

Lift-Json Extracting from JSON object

I have this code below:
object Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
val kafkaBrokers = Map("metadata.broker.list" -> "HostName:9092")
val offsetMap = Map(TopicAndPartition("topic_test", 0), 8)
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaBrokers, offsetMap)
var offsetArray = Array[OffsetRange]()
lines.transform {rdd =>
offsetArray = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map {
_.message()
}.foreachRDD {rdd =>
/* NEW CODE */
}
ssc.start()
ssc.awaitTermination()
}
}
I have added the new code uder the comment /* NEW CODE */. My question is the lines val will contain a sequence of RDD's which basically form the kafka sever every 3 seconds. Then I am grabbing the message using the map function.
But I am a little confused on what the foreachRDD function does. Does that iterate over all of the RDD's which are in the lines DStream (which is what I am trying to do)? The thing is the parse function from the lift-json library only accepts a string so I need to iterate over all of the rdd's and pass that String value to the parse function which is what I attempted to do. But nothing is being printed out for some reason.
If you want to read data from a specific offset, you're using the wrong overload.
The one you need is this:
createDirectStream[K,
V,
KD <: Decoder[K],
VD <: Decoder[V], R]
(ssc: StreamingContext,
kafkaParams: Map[String, String],
fromOffsets: Map[TopicAndPartition, Long],
messageHandler: (MessageAndMetadata[K, V]) ⇒ R): InputDStream[R]
You need a Map[TopicAndPartition, Long]:
val offsetMap = Map(TopicAndPartition("topic_test", 0), 8L)
And you need to pass a function which receives a MessageAndMetadata[K, V] and returns your desired type, for example:
val extractKeyValue: MessageAndMetadata[String, String] => (String, String) =
msgAndMeta => (msgAndMeta.key(), msgAndMeta.message())
And use it:
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]
(ssc, kafkaBrokers, offsetMap, extractKeyValue)

Spark streaming - transform two streams and join

I've got an issue where I need to transform two streams am reading from spark before joining.
Once I do the transformation, I no longer can join, I guess the type is no longer DStream[(String, String)] but DStream[Map[String, String]]
val windowStream1 = act1Stream.window(Seconds(5)).transform{rdd => rdd.map(_._2).map(l =>(...toMap)}
val windowStream2 = act2Stream.window(Seconds(5)).transform{rdd => rdd.map(_._2).map(l =>(...toMap)}
val joinedWindow = windowStream1.join(windowStream2) //can't join
Any idea ?
This doesn't solve your problem but makes it more digestible. You can split up the method chain and document which types you would expect on each step by defining temporal val/def/var identifiers with the expected type. This way you can easily spot where the type is not matching your expectations anymore.
E.g. I expect your act1Stream and act2Stream instances to be of type DStream[(String, String)], which i will call s1 and s2 for the moment. Comment me if that is not the case.
def joinedWindow(
s1: DStream[(String, String)],
s2: DStream[(String, String)]
): DStream[...] = {
val w1 = windowedStream(s1)
val w2 = windowedStream(s2)
w1.join(w2)
}
def windowedStream(actStream: DStream[(String, String)]): DStream[Map[...]] = {
val windowed: DStream[(String, String)] = actStream.window(Seconds(5))
windowed.transform( myTransform )
}
def myTransform(rdd: RDD[(String, String)]): RDD[Map[...]] = {
val mapped: RDD[String] = rdd.map(_._2)
// not enough information to conclude
// the result type from given code
mapped.map(l =>(...toMap))
}
From there one can conclude the rest of the types by filling the ... sections. Line by line eliminating compiler errors until you get your desired results. With the documentation of
DStream[T]
def window(windowDuration: Duration): DStream[T]
def transform[U](transformFunc: (RDD[T]) ⇒ RDD[U])(implicit arg0: ClassTag[U]): DStream[U]
PairDStreamFunctions[K,V]
def join[W](other: DStream[(K, W)])(implicit arg0: ClassTag[W]): DStream[(K, (V, W))]
RDD[T]
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
At least this way you get to the point where you exactly know that the expected type and the produced type do not match.

Spark createStream error when creating a stream to decode byte arrays in IntelliJ using the Scala plugin

I'm trying to modify the KafkaWordCount spark streaming example to take in a byte stream. This is my code so far :
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SiMod").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
var event: Event = null
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream[String, Array[Byte], DefaultDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_ONLY_SER)
The last line -
val lines = KafkaUtils.createStream[String, Array[Byte], DefaultDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_ONLY_SER)
is giving an error in IntelliJ, although as far as I can see my usage is the same as in other examples.
Error:(35, 41) overloaded method value createStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyTypeClass: Class[String],valueTypeClass: Class[Array[Byte]],keyDecoderClass: Class[kafka.serializer.DefaultDecoder],valueDecoderClass: Class[kafka.serializer.DefaultDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Map[String,Integer],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream[String,Array[Byte]] <and>
(ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Map[String,Int],storageLevel: org.apache.spark.storage.StorageLevel)(implicit evidence$1: scala.reflect.ClassTag[String], implicit evidence$2: scala.reflect.ClassTag[Array[Byte]], implicit evidence$3: scala.reflect.ClassTag[kafka.serializer.DefaultDecoder], implicit evidence$4: scala.reflect.ClassTag[kafka.serializer.DefaultDecoder])org.apache.spark.streaming.dstream.ReceiverInputDStream[(String, Array[Byte])]
cannot be applied to (org.apache.spark.streaming.StreamingContext, String, String, scala.collection.immutable.Map[String,Int])
val lines = KafkaUtils.createStream[String, Array[Byte], DefaultDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap)
What can I do about this?
Try with a String decoder instead for the Key:
val lines = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_ONLY_SER)