Lift-Json Extracting from JSON object - scala

I have this code below:
object Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
val kafkaBrokers = Map("metadata.broker.list" -> "HostName:9092")
val offsetMap = Map(TopicAndPartition("topic_test", 0), 8)
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaBrokers, offsetMap)
var offsetArray = Array[OffsetRange]()
lines.transform {rdd =>
offsetArray = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map {
_.message()
}.foreachRDD {rdd =>
/* NEW CODE */
}
ssc.start()
ssc.awaitTermination()
}
}
I have added the new code uder the comment /* NEW CODE */. My question is the lines val will contain a sequence of RDD's which basically form the kafka sever every 3 seconds. Then I am grabbing the message using the map function.
But I am a little confused on what the foreachRDD function does. Does that iterate over all of the RDD's which are in the lines DStream (which is what I am trying to do)? The thing is the parse function from the lift-json library only accepts a string so I need to iterate over all of the rdd's and pass that String value to the parse function which is what I attempted to do. But nothing is being printed out for some reason.

If you want to read data from a specific offset, you're using the wrong overload.
The one you need is this:
createDirectStream[K,
V,
KD <: Decoder[K],
VD <: Decoder[V], R]
(ssc: StreamingContext,
kafkaParams: Map[String, String],
fromOffsets: Map[TopicAndPartition, Long],
messageHandler: (MessageAndMetadata[K, V]) ⇒ R): InputDStream[R]
You need a Map[TopicAndPartition, Long]:
val offsetMap = Map(TopicAndPartition("topic_test", 0), 8L)
And you need to pass a function which receives a MessageAndMetadata[K, V] and returns your desired type, for example:
val extractKeyValue: MessageAndMetadata[String, String] => (String, String) =
msgAndMeta => (msgAndMeta.key(), msgAndMeta.message())
And use it:
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]
(ssc, kafkaBrokers, offsetMap, extractKeyValue)

Related

Transform data using scala in spark

I am trying to transform the input text file into a Key/Value RDD, but the code below doesn't work.(The text file is a tab separated file.) I am really new to Scala and Spark so I would really appreciate your help.
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object shortTwitter {
def main(args: Array[String]): Unit = {
for (line <- Source.fromFile(args(1).txt).getLines()) {
val newLine = line.map(line =>
val p = line.split("\t")
(p(0).toString, p(1).toInt)
)
}
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile(args(0))
val counts = text.flatMap(line => line.split("\t"))
}
}
I'm assuming you want the resulting RDD to have the type RDD[(String, Int)], so -
You should use map (which transforms each record into a single new record) and not flatMap (which transform each record into multiple records)
You should map the result of the split into a tuple
Altogether:
val counts = text
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
EDIT per clarification in comment: if you're also interested in fixing the non-Spark part (which reads the file sequentially), you have some errors in the for-comprehension syntax, here's the entire thing:
def main(args: Array[String]): Unit = {
// read the file without Spark (not necessary when using Spark):
val countsWithoutSpark: Iterator[(String, Int)] = for {
line <- Source.fromFile(args(1)).getLines()
} yield {
val p = line.split("\t")
(p(0), p(1).toInt)
}
// equivalent code using Spark:
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val counts: RDD[(String, Int)] = sc.textFile(args(0))
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
}

KafkaUtils API | offset management | Spark Streaming

I am trying to manage kafka offsets for exactly once semantics.
Facing problem while creating a direct stream using offset map as follows :
val fromOffsets : (TopicAndPartition, Long) = TopicAndPartition(metrics_rs.getString(1), metrics_rs.getInt(2)) -> metrics_rs.getLong(3)
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder,(String, String)] (ssc,kafkaParams,fromOffsets,messageHandler)
here,
val messageHandler =
(mmd: MessageAndMetadata[String, String]) => mmd.message.length
And
metrics_rs = metricsStatement.executeQuery("SELECT part,off from metrics.txn_offsets where topic='"+t+''' )
I guess I am doing something wrong with the declaration style...if you could help.
The compilation error says "too many type arguments for createDirectStream"
A couple of things I see that you're doing wrong.
You need to pass a Map[TopicAndPartition, Long], while currently you have a Tuple2[TopicAndPartition, Long]. So you need:
val fromOffsets: Map[TopicAndPartition, Long] =
Map(TopicAndPartition(metrics_rs.getString(1),
metrics_rs.getInt(2)) -> metrics_rs.getLong(3))
You say your return type from createDirectStream is a tuple of type (String, String), yet your messageHandler value is an Int. If you want to return a tuple with key value pairs, you need:
val messageHandler: MessageAndMetadata[String, String] => (String, String) =
(mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message())
After fixing that, this should compile:
val stream = KafkaUtils
.createDirectStream[String, String,
StringDecoder, StringDecoder,
(String, String)] (ssc,
kafkaParams,
fromOffsets,
messageHandler)

How to pass hiveContext as argument to functions spark scala

I have created a hiveContext in main() function in Scala and I need to pass through parameters this hiveContext to other functions, this is the structure:
object Project {
def main(name: String): Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
...
}
def read (streamId: Int, hc:hiveContext): Array[Byte] = {
...
}
def close (): Unit = {
...
}
}
but it doesn't work. Function read() is called inside main().
any idea?
I'm declaring hiveContext as implicit, this is working for me
implicit val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)
Defined in MyJob:
override def run(config: Config)(implicit sqlContext: SQLContext): Unit = ...
But if you don't want it implicit, this should be the same
val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)(sqlContext)
override def run(config: Config)(sqlContext: SQLContext): Unit = ...
Also, your function read should receive HiveContext as the type for the parameter hc, and not hiveContext
def read (streamId: Int, hc:HiveContext): Array[Byte] =
I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion.
Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.

Spark streaming - transform two streams and join

I've got an issue where I need to transform two streams am reading from spark before joining.
Once I do the transformation, I no longer can join, I guess the type is no longer DStream[(String, String)] but DStream[Map[String, String]]
val windowStream1 = act1Stream.window(Seconds(5)).transform{rdd => rdd.map(_._2).map(l =>(...toMap)}
val windowStream2 = act2Stream.window(Seconds(5)).transform{rdd => rdd.map(_._2).map(l =>(...toMap)}
val joinedWindow = windowStream1.join(windowStream2) //can't join
Any idea ?
This doesn't solve your problem but makes it more digestible. You can split up the method chain and document which types you would expect on each step by defining temporal val/def/var identifiers with the expected type. This way you can easily spot where the type is not matching your expectations anymore.
E.g. I expect your act1Stream and act2Stream instances to be of type DStream[(String, String)], which i will call s1 and s2 for the moment. Comment me if that is not the case.
def joinedWindow(
s1: DStream[(String, String)],
s2: DStream[(String, String)]
): DStream[...] = {
val w1 = windowedStream(s1)
val w2 = windowedStream(s2)
w1.join(w2)
}
def windowedStream(actStream: DStream[(String, String)]): DStream[Map[...]] = {
val windowed: DStream[(String, String)] = actStream.window(Seconds(5))
windowed.transform( myTransform )
}
def myTransform(rdd: RDD[(String, String)]): RDD[Map[...]] = {
val mapped: RDD[String] = rdd.map(_._2)
// not enough information to conclude
// the result type from given code
mapped.map(l =>(...toMap))
}
From there one can conclude the rest of the types by filling the ... sections. Line by line eliminating compiler errors until you get your desired results. With the documentation of
DStream[T]
def window(windowDuration: Duration): DStream[T]
def transform[U](transformFunc: (RDD[T]) ⇒ RDD[U])(implicit arg0: ClassTag[U]): DStream[U]
PairDStreamFunctions[K,V]
def join[W](other: DStream[(K, W)])(implicit arg0: ClassTag[W]): DStream[(K, (V, W))]
RDD[T]
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
At least this way you get to the point where you exactly know that the expected type and the produced type do not match.

Spark createStream error when creating a stream to decode byte arrays in IntelliJ using the Scala plugin

I'm trying to modify the KafkaWordCount spark streaming example to take in a byte stream. This is my code so far :
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SiMod").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
var event: Event = null
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream[String, Array[Byte], DefaultDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_ONLY_SER)
The last line -
val lines = KafkaUtils.createStream[String, Array[Byte], DefaultDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_ONLY_SER)
is giving an error in IntelliJ, although as far as I can see my usage is the same as in other examples.
Error:(35, 41) overloaded method value createStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyTypeClass: Class[String],valueTypeClass: Class[Array[Byte]],keyDecoderClass: Class[kafka.serializer.DefaultDecoder],valueDecoderClass: Class[kafka.serializer.DefaultDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Map[String,Integer],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream[String,Array[Byte]] <and>
(ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Map[String,Int],storageLevel: org.apache.spark.storage.StorageLevel)(implicit evidence$1: scala.reflect.ClassTag[String], implicit evidence$2: scala.reflect.ClassTag[Array[Byte]], implicit evidence$3: scala.reflect.ClassTag[kafka.serializer.DefaultDecoder], implicit evidence$4: scala.reflect.ClassTag[kafka.serializer.DefaultDecoder])org.apache.spark.streaming.dstream.ReceiverInputDStream[(String, Array[Byte])]
cannot be applied to (org.apache.spark.streaming.StreamingContext, String, String, scala.collection.immutable.Map[String,Int])
val lines = KafkaUtils.createStream[String, Array[Byte], DefaultDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap)
What can I do about this?
Try with a String decoder instead for the Key:
val lines = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_ONLY_SER)