spark streaming update_state_by_keys for arrays aggregation - scala

I have input lines like below
t1, file1, 1, 1, 1
t1, file1, 1, 2, 3
t1, file2, 2, 2, 2, 2
t2, file1, 5, 5, 5
t2, file2, 1, 1, 2, 2
and the output like below rows which is a vertical addition of the corresponding numbers.
file1 : [ 1+, 1+2+5, 1+3+5 ]
file2 : [ 2+1, 2+1, 2+2, 2+2 ]
Currently data aggregation logic is working for batch interval, but it's not maintaining state. So, i am adding update_state_by_key function and passing below function, Is this right way to do?
My current program:
def updateValues( newValues: Seq[Array[Int]], currentValue: Option[Array[Int]]) = {
val previousCount = currentValue.getOrElse(Array.fill[Byte](newValues.length)(0))
val allValues = newValues +: previousCount
Some(allValues.toList.transpose.map(_.sum).toArray)
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for all streaming functionality
val ssc = new StreamingContext(sc, Seconds(2))
// parse the lines of data into coverage objects
val inputStream = ssc.socketTextStream(<hostname>, 9999)
ssc.checkpoint("<hostname>:8020/user/spark/checkpoints_dir")
inputStream.print(10)
val parsedDstream = inputStream
.map(line => {
val splitLines = line.split(",")
(splitLines(1), splitLines.slice(2, splitLines.length).map(_.trim.toInt))
})
val aggregated_file_counts = parsedDstream.updateStateByKey(updateValues)
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
For reference, my previous program (without stateful transformation):
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for all streaming functionality
val ssc = new StreamingContext(sc, Seconds(2))
val inputStream = ssc.socketTextStream("hostname", 9999)
val parsedDstream = inputStream
.map(line => {
val splitLines = line.split(",")
(splitLines(1), splitLines.slice(2, splitLines.length).map(_.trim.toInt))
})
.reduceByKey((first, second) => {
val listOfArrays = ArrayBuffer(first, second)
listOfArrays.toList.transpose.map(_.sum).toArray
})
.foreachRDD(rdd => rdd.foreach(Blaher.blah))
}
Thanks in advance.

What you're looking for is updateStateByKey. For DStream[(T, U)] it should take a function with two arguments:
Seq[U] - representing state for current window
Option[U] - representing accumulated state
and return Option[U].
Given your code it could be implemented for example like this:
import breeze.linalg.{DenseVector => BDV}
import scala.util.Try
val state: DStream[(String, Array[Int])] = parsedStream.updateStateByKey(
(current: Seq[Array[Int]], prev: Option[Array[Int]]) => {
prev.map(_ +: current).orElse(Some(current))
.flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
})
To be able to use it you'll have to configure checkpointing.

Related

Scala/Spark reduce RDD of class objects

I need your help for my last step of a school project.
val conf: SparkConf = new SparkConf() .setMaster("local[*]") .setAppName("AppName") .set("spark.driver.host", "localhost")
val sc: SparkContext = new SparkContext(conf)
var list_creature = new ListBuffer[creature]()
list_creature += new creature("ska")
list_creature(0).addspell("Heal")
list_creature(0).addspell("Attaque")
list_creature += new creature("moise")
list_creature(1).addspell("Tank")
list_creature(1).addspell("Defense")
list_creature(1).addspell("Attaque")
val rdd = sc.parallelize(list_creature)
val y = rdd.map(e=>(e.name,e.Spells)).collect()
val z = y.flatMap(x =>ListBuffer(x._2->x._1))
val ze = z.flatMap(e =>e._1.flatMap(x => ListBuffer(x->e._2)))
i get this as a result,
(Heal,ska)
(Attaque,ska)
(Tank,moise)
(Defense,moise)
(Attaque,moise)
So, i want to reduce this List[List[String]]
to get List[String,List[string]]
and the result will be :
(Heal,(ska))
(Attaque,(ska,moise))
(Tank,(moise))
(Defense,(moise))
Thanks you're the best ...
Not sure why you create a RDD then collect before all the major transformations. Since you didn't provide definition of class Creature, I'm creating a placeholder class based on your question content as follows:
class Creature(val name: String) extends Serializable {
var spells: List[String] = List.empty[String]
def addspell(spell: String): Unit = {
spells ::= spell
}
}
import scala.collection.mutable.ListBuffer
val list_creature = ListBuffer[Creature]()
list_creature += new Creature("ska")
list_creature(0).addspell("Heal")
list_creature(0).addspell("Attaque")
list_creature += new Creature("moise")
list_creature(1).addspell("Tank")
list_creature(1).addspell("Defense")
list_creature(1).addspell("Attaque")
val rdd = sc.parallelize(list_creature)
val reducedRDD = rdd.flatMap( c => c.spells.map(s => (s, List(c.name))) ).
reduceByKey( _ ++ _ )
reducedRDD.collect
// res1: Array[(String, List[String])] = Array(
// (Heal,List(ska)), (Defense,List(moise)), (Attaque,List(ska, moise)), (Tank,List(moise)
// ))

Spark Streaming - Broadcast variable - Case Class

My requirement is to enrich data stream data with profile information from a HBase table. I was looking to use a broadcast variable. Enclosed the whole code here.
The output of HBase data is as follows
In the Driver node HBaseReaderBuilder
(org.apache.spark.SparkContext#3c58b102,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
In the Worker node
HBaseReaderBuilder(null,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
As you can see it has lost the spark context. When i issue the statement val
myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) i get a NullPointerException
java.lang.NullPointerException
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toSimpleHBaseRDD(HBaseReaderBuilder.scala:83)
at it.nerdammer.spark.hbase.package$.toSimpleHBaseRDD(package.scala:5)
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toHBaseRDD(HBaseReaderBuilder.scala:67)
at it.nerdammer.spark.hbase.package$.toHBaseRDD(package.scala:5)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:34)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:33)
object testPartition {
def main(args: Array[String]) : Unit = {
val sparkMaster = "spark://x.x.x.x:7077"
val ipaddress = "x.x.x.x:2181" // Zookeeper
val hadoopHome = "/home/hadoop/software/hadoop-2.6.0"
val topicname = "new_events_test_topic"
val mainConf = new SparkConf().setMaster(sparkMaster).setAppName("testingPartition")
val mainSparkContext = new SparkContext(mainConf)
val ssc = new StreamingContext(mainSparkContext, Seconds(30))
val eventsStream = KafkaUtils.createStream(ssc,"x.x.x.x:2181","receive_rest_events",Map(topicname.toString -> 2))
val docRdd = mainSparkContext.hbaseTable[(String, Option[String], Option[String])]("hbase_customer_profile").select("gender","age").inColumnFamily("data")
println ("docRDD from Driver ",docRdd)
val broadcastedprof = mainSparkContext.broadcast(docRdd)
eventsStream.foreachRDD(dstream => {
dstream.foreachPartition(records => {
println("Broadcasted docRDD - in Worker ", broadcastedprof.value)
val bcdocRdd = broadcastedprof.value
records.foreach(record => {
//val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3))
//myRdd.foreach(println)
val Rows = record._2.split("\r\n")
})
})
})
ssc.start()
ssc.awaitTermination()
}
}

Make RDD from LIST[Row] In scala(in spark)

I'm making some code with scala & spark and want to make CSV file from RDD or LIST[Row].
I wanted to process 'ListRDD' data parellel so I thouth output data would be more than one file.
val conf = new SparkConf().setAppName("Csv Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val logFile = "data.csv "
val rawdf = sqlContext.read.format("com.databricks.spark.csv")....
val rowRDD = rawdf.map { row =>
Row(
row.getAs( myMap.ID).toString,
row.getAs( myMap.Dept)
.....
)
}
val df = sqlContext.createDataFrame(rowRDD, mySchema)
val MapRDD = df.map { x => (x.getAs[String](myMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
myClass.myFunction( ListRDD)
in myClass..
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: RDD[Row]
ListRDD.foreach( row => {
rows.add? gather? ( make(row._2)) // make( row._2) will return List[Row]
})
rows.saveAsFile(" path") // it's my final goal
}
def make( list: List[Row]) : List[Row] = {
data processing from List[Row]
}
I tried to make RDD data from List by sc.parallelize( list) BUT somehow nothing works. anyidea to make RDD type data from make function.
If you want to make an RDD from a List[Row], here is a way to do so
//Assuming list is your List[Row]
val newRDD: RDD[Object] = sc.makeRDD(list.toArray());

Why does mapPartitions print nothing to stdout?

I have this code in scala
object SimpleApp {
def myf(x: Iterator[(String, Int)]): Iterator[(String, Int)] = {
while (x.hasNext) {
println(x.next)
}
x
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val tx1 = sc.textFile("/home/paourissi/Desktop/MyProject/data/testfile1.txt")
val file1 = tx1.flatMap(line => line.split(" ")).map(word => (word, 1))
val s = file1.mapPartitions(x => myf(x))
}
}
I am trying to figure out why it doesn't print anything on the output. I run this on a local machine and not on a cluster.
You only have transformations, no actions. Spark will not execute until an action is called. Add this line to print out the top 10 of your results.
s.take(10).foreach(println)
mapPartitions is a transformation, and thus lazy
If you will add an action in the end, the whole expression will be evaluated. Try adding s.count in the end.

convert scala string to RDD[seq[string]]

// 4 workers
val sc = new SparkContext("local[4]", "naivebayes")
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)
documents.zipWithIndex.foreach{
case (e, i) =>
val collectedResult = Tokenizer.tokenize(e.mkString)
}
val hashingTF = new HashingTF()
//pass collectedResult instead of document
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
in the above code snippet, i would want to extract collectedResult to reuse it for hashingTF.transform, How can this be achieved where the signature of tokenize function is
def tokenize(content: String): Seq[String] = {
...
}
Looks like you want map rather than foreach. I don't understand what you're using zipWithIndex for, nor why you're calling split on your lines only to join them straight back up again with mkString.
val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...