Splitting Spark stream by delimiter - scala

I am trying to split my Spark stream based on a delimiter and save each of these chunks to a new file.
Each of my RDDs appear to be partitioned according to the delimiter.
I am having difficulty in configuring one delimiter message per RDD, or, being able to save each partition individually to a new part-000... file .
Any help would be much appreciated. Thanks
val sparkConf = new SparkConf().setAppName("DataSink").setMaster("local[8]").set("spark.files.overwrite","false")
val ssc = new StreamingContext(sparkConf, Seconds(2))
class RouteConsumer extends Actor with ActorHelper with Consumer {
def endpointUri = "rabbitmq://server:5672/myexc?declare=false&queue=in_hl7_q"
def receive = {
case msg: CamelMessage =>
val m = msg.withBodyAs[String]
store(m.body)
}
}
val dstream = ssc.actorStream[String](Props(new RouteConsumer()), "SparkReceiverActor")
val splitStream = dstream.flatMap(_.split("MSH|^~\\&"))
splitStream.foreachRDD( rdd => rdd.saveAsTextFile("file:///home/user/spark/data") )
ssc.start()
ssc.awaitTermination()

You can't control which part-NNNNN (partition) file gets which output, but you can write to different directories. The "easiest" way to do this sort of column splitting is with separate map statements (like SELECT statements), something like this, assuming you'll have n array elements after splitting:
...
val dstream2 = dstream.map(_.split("...")) // like above, but with map
dstream2.cache() // very important for what follows, repeated reads of this...
val dstreams = new Array[DStream[String]](n)
for (i <- 0 to n-1) {
dstreams[i] = dstream2.map(array => array[i] /* or similar */)
dstreams[i].saveAsTextFiles(rootDir+"/"+i)
}
ssc.start()
ssc.awaitTermination()

Related

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage

spark job freeze when started in ParArray

I want to convert a set of time-serial data to Labeledpoint from multiple csv files and save to parquet file. Csv Files are small, usually < 10MiB
When I start it with ParArray, it submit 4 jobs a time and freeze . codes here
val idx = Another_DataFrame
ListFiles(new File("data/stock data"))
.filter(_.getName.contains(".csv")).zipWithIndex
.par //comment this line and code runs smoothly
.foreach{
f=>
val stk = spark_csv(f._1.getPath) //doing good
ColMerge(stk,idx,RESULT_PATH(f)) //freeze here
stk.unpersist()
}
and the freeze part:
def ColMerge(ori:DataFrame,index:DataFrame,PATH:String) = {
val df = ori.join(index,ori("date")===index("index_date")).drop("index_date").orderBy("date").cache
val head = df.head
val col = df.columns.filter(e=>e!="code"&&e!="date"&&e!="name")
val toMap = col.filter{
e=>head.get(head.fieldIndex(e)).isInstanceOf[String]
}.sorted
val toCast = col.diff(toMap).filterNot(_=="data")
val res: Array[((String, String, Array[Double]), Long)] = df.sort("date").map{
row=>
val res1= toCast.map{
col=>
row.getDouble(row.fieldIndex(col))
}
val res2= toMap.flatMap{
col=>
val mapping = new Array[Double](GlobalConfig.ColumnMapping(col).size)
row.getString(row.fieldIndex(col)).split(";").par.foreach{
word=>
mapping(GlobalConfig.ColumnMapping(col)(word)) = 1
}
mapping
}
(
row.getString(row.fieldIndex("code")),
row.getString(row.fieldIndex("date")),
res1++res2++row.getAs[Seq[Double]]("data")
)
}.zipWithIndex.collect
df.unpersist
val dataset = GlobalConfig.sctx.makeRDD(res.map{
day=>
(day._1._1,
day._1._2,
try{
new LabeledPoint(GetHighPrice(res(day._2.toInt+2)._1._3.slice(0,4))/GetLowPrice(res(day._2.toInt)._1._3.slice(0,4))*1.03,Vectors.dense(day._1._3))
}
catch {
case ex:ArrayIndexOutOfBoundsException=>
new LabeledPoint(-1,Vectors.dense(day._1._3))
}
)
}).filter(_._3.label != -1).toDF("code","date","labeledpoint")
dataset.write.mode(SaveMode.Overwrite).parquet(PATH)
}
The exact job that freezes is the DataFrame.sort() or zipWithIndex when generating res in ColMerge
Since most part of the job get done after collect I really want to use ParArray to accelerate ColMerge but this weird freeze stopped me from doing so. Do I need to new a thread pool to do this?

Spark Code Optimization

My task is to write a code that reads a big file (doesn't fit into memory) reverse it and output most five frequent words .
i have written the code below and it does the job .
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object ReverseFile {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Reverse File")
conf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(conf)
val txtFile = "path/README_mid.md"
val txtData = sc.textFile(txtFile)
txtData.cache()
val tmp = txtData.map(l => l.reverse).zipWithIndex().map{ case(x,y) => (y,x)}.sortByKey(ascending = false).map{ case(u,v) => v}
tmp.coalesce(1,true).saveAsTextFile("path/out.md")
val txtOut = "path/out.md"
val txtOutData = sc.textFile(txtOut)
txtOutData.cache()
val wcData = txtOutData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(ascending = false)
wcData.collect().take(5).foreach(println)
}
}
The problem is that i'm new to spark and scala, and as you can see in the code first i read the file reverse it save it then reads it reversed and output the five most frequent words .
Is there a way to tell spark to save tmp and process wcData (without the need to save,open file) at the same time because otherwise its like reading the file twice .
From now on i'm going to tackle with spark a lot, so if there is any part of the code (not like the absolute path name ... spark specific) that you might think could be written better i'de appreciate it.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object ReverseFile {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Reverse File")
conf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(conf)
val txtFile = "path/README_mid.md"
val txtData = sc.textFile(txtFile)
txtData.cache()
val reversed = txtData
.zipWithIndex()
.map(_.swap)
.sortByKey(ascending = false)
.map(_._2) // No need to deconstruct the tuple.
// No need for the coalesce, spark should do that by itself.
reversed.saveAsTextFile("path/reversed.md")
// Reuse txtData here.
val wcData = txtData
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.map(_.swap)
.sortByKey(ascending = false)
wcData
.take(5) // Take already collects.
.foreach(println)
}
}
Always do the collect() last, so Spark can evaluate things on the cluster.
The most expensive part of your code is sorting so the obvious improvement is to remove it. It is relatively simple in the second case where full sort is completely obsolete:
val wcData = txtData
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _) // No need to swap or sort
// Use top method and explicit ordering in place of swap / sortByKey
val wcData = top(5)(scala.math.Ordering.by[(String, Int), Int](_._2))
Reversing order of lines is a little bit trickier. First lets reorder elements per partition:
val reversedPartitions = txtData.mapPartitions(_.toList.reverse.toIterator)
Now you have two options
use custom partitioner
class ReversePartitioner(n: Int) extends Partitioner {
def numPartitions: Int = n
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[Int]
return numPartitions - 1 - k
}
}
val partitioner = new ReversePartitioner(reversedPartitions.partitions.size)
val reversed = reversedPartitions
// Add current partition number
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.toList)))
// Repartition to get reversed order
.partitionBy(partitioner)
// Drop partition numbers
.values
// Reshape
.flatMap(identity)
It still requires shuffling but it is relatively portable and data is still accessible in memory.
if all you want is to save reversed data you can call saveAsTextFile on reversedPartitions and reorder output files logically. Since part-n name format identifies source partitions all you have to do is to rename part-n to part-(number-of-partitions - 1 -n). It requires saving data so it is not exactly optimal but if you for example use in-memory file system can be a pretty good solution.

working with two RDDs apache spark

I am using calliope i.e. spark plugin to connect with cassandra. I have created 2 RDDs which looks like
class A
val persistLevel = org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
val cas1 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 1")
val sc1 = new SparkContext("local", "name it any thing ")
var rdd1 = sc.cql3Cassandra[SCALACLASS_1](cas1)
var rddResult1 = rdd1.persist(persistLevel)
class B
val cas2 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 2")
var rdd2 = sc1.cql3Cassandra[SCALACLASS_2](cas2)
var rddResult2 = rdd2.persist(persistLevel)
somehow following code base which creates a new RDD using the other 2 is not working. Is it possible that we cannot iterate with 2 RDDs together?
Here is the code snippet which is not working -
case class Report(id: Long, anotherId: Long)
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**rddResult1.collect().toList**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
while if I replace the bold thing and initialize a val for it like -
val collection = rddResult1.collect().toList
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**collection**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
it works, is there any explaination?
You are mixing a transformation with an action. The closure of the rdd2.flatMap is executed on the workers, while rdd1.collect is an 'action' in Spark lingo and delivers data back to the driver. So, informally, you could say that the data is not there when you try to flatMap over it. (I don't know enough of the internals -yet- to pin-point the exact root-cause)
If you want to operate on both RDDs distributedly, you should join them using one of the join functions (join, leftOuterJoin, rightOuterJoin, cogroup).
E.g.
val mappedRdd1 = rdd1.map(x=> (x.id,x))
val mappedRdd2 = rdd2.map(x=> (x.customerId, x))
val joined = mappedRdd1.join(mappedRdd2)
joined.flatMap(...reporting logic..).collect
You can operate on RDDs in the application. But you cannot operate on RDDs in the executors (the worker nodes). The executors cannot give commands to drive the cluster. The code inside flatMap runs on the executors.
In the first case, you try to operate on an RDD in the executor. I reckon you would get a NotSerializableException as you cannot even send the RDD object to the executors. In the second case, you pull the RDD contents to the application, and then send this simple List to the executors. (Lambda captures are automatically serialized.)

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file