Counting records of my RDDs in a large Dstream - scala

I am trying to work with a large RDD as read by a file DStream.
The code looks as follows:
val creatingFunc = { () =>
val conf = new SparkConf()
.setMaster("local[10]")
.setAppName("FileStreaming")
.set("spark.streaming.fileStream.minRememberDuration", "2000000h")
.registerKryoClasses(Array(classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text], classOf[GGSN]))
val sc = new SparkContext(conf)
// Create a StreamingContext
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
val appFile = httpFileLines
.map(x=> (x._1,x._2.toString()))
.filter(!_._2.contains("ggsnIPAddress"))
.map(x=>(x._1,x._2.split(",")))
var count=0
appFile.foreachRDD(s => {
// s.collect() throw exception due to insufficient amount of emery
//s.count() throw exception due to insufficient amount of memory
s.foreach(x => count = count + 1)
})
println(count)
newContextCreated = true
ssc
}
what I am trying to do is to get the count of my RDD..however since it is large..it throws exception..so I need to do a foreach instead to avoid collecting data to memory..
I wanna to get the count then as the way in my code but it always gives 0..
Is there a way to do this?

There's no need to foreachRDD and call count. You can use the count method defined on DStream:
val appFile = httpFileLines
.map(x => (x._1, x._2.toString()))
.filter(!_._2.contains("ggsnIPAddress"))
.map(x => (x._1, x._2.split(",")))
val count = appFile.count()
If that still yields an insufficient amount of memory exception, you either need to be calculating smaller batches of data each time, or enlarge you worker nodes to handle the load.

Regarding your solution, you should avoid the collect and sum the count of each RDD of the DStream.
var count=0
appFile.foreachRDD { rdd => {
count = count + rdd.count()
}
}
But I found this solution very ugly (the use of a var in scala).
I prefer the following solution:
val count: Long = errorDStream.count().reduce(_+_)
Notice, that the count method return a DStream of Long and not a Long, this is why you need to use the reduce.

Related

spark 2.x with mapPartitions large number of records parallel processing

I am trying to use spark mapPartitions with Datasets[Spark 2.x] for copying large list of files [1 million records] from one location to another in parallel.
However, at times, I am seeing that one record is getting copied multiple times.
The idea is to split 1 million files into number of partitions (here, 24). Then for each partition, perform copy operation in parallel and finally get result from each partition to perform further actions.
Can someone please tell me what am I doing wrong?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 million records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
var dfBuffer = new ListBuffer[DataFrame]()
dfBuffer += SourceTargetDataSet
.mapPartitions(partition => {
println("partition id: " + TaskContext.getPartitionId)
//for each partition
val result = partition
.map(row => {
val source = row._1
val target = row._2
val copyStatus = copyFiles(source, target) // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.toList
result.toIterator
})
.toDF()
val dfList = dfBuffer.toList
val newDF = dfList.tail.foldLeft(dfList.head)(
(accDF, newDF) => accDF.join(newDF, Seq("_1"))
)
println("newDF Count " + newDF.count)
newDF
}
Update 2: I changed the function as shown below and so far it is giving me consistent results as expected. May I know what I was doing wrong and am I getting the required parallelization using below function? If not, how can this be optimized?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 miilion records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
val iterator = SourceTargetDataSet.toDF
.mapPartitions(
(it: Iterator[Row]) =>
it.toList
.map(row => {
println(row)
val source = row.toString.split(",")(0).drop(1)
val target = row.toString.split(",")(1).dropRight(1)
println("source : " + source)
println("target: " + target)
val copyStatus = copyFiles() // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.iterator
)
.toLocalIterator
val df = y.toList.toDF("targetKey", "copyStatus")
df
}
One should avoid performing write operations in map actions because they can be replayed when an executor dies and the same map has to be performed by another executer.
I'd choose foreach instead.

How to use COGROUP for large datasets

I have two rdd's namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I'm using cogroup for those datasets like:
val tab_c = tab_a.cogroup(tab_b).collect.toArray
val updated = tab_c.map { x =>
{
//somecode
}
}
I'm using tab_c cogrouped values for map function and it works fine for small datasets but in case of huge datasets it throws Out Of Memory exception.
I have tried converting the final value to RDD but no luck same error
val newcos = spark.sparkContext.parallelize(tab_c)
1.How to use Cogroup for large datasets ?
2.Can we persist the cogrouped value ?
Code
val source_primary_key = source.map(rec => (rec.split(",")(0), rec))
source_primary_key.persist(StorageLevel.DISK_ONLY)
val destination_primary_key = destination.map(rec => (rec.split(",")(0), rec))
destination_primary_key.persist(StorageLevel.DISK_ONLY)
val cos = source_primary_key.cogroup(destination_primary_key).repartition(10).collect()
var srcmis: Array[String] = new Array[String](0)
var destmis: Array[String] = new Array[String](0)
var extrainsrc: Array[String] = new Array[String](0)
var extraindest: Array[String] = new Array[String](0)
var srcs: String = Seq("")(0)
var destt: String = Seq("")(0)
val updated = cos.map { x =>
{
val key = x._1
val value = x._2
srcs = value._1.mkString(",")
destt = value._2.mkString(",")
if (srcs.equalsIgnoreCase(destt) == false && destt != "") {
srcmis :+= srcs
destmis :+= destt
}
if (srcs == "") {
extraindest :+= destt.mkString("")
}
if (destt == "") {
extrainsrc :+= srcs.mkString("")
}
}
}
Code Updated:
val tab_c = tab_a.cogroup(tab_b).filter(x => x._2._1 =!= x => x._2._2)
// tab_c = {1,Compactbuffer(1,john,US),Compactbuffer(1,john,UK)}
{2,Compactbuffer(2,john,US),Compactbuffer(2,johnson,UK)}..
ERROR:
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(4,3,ResultTask,FetchFailed(null,0,-1,27,org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
ERROR YarnScheduler: Lost executor 8 on datanode1: Container killed by YARN for exceeding memory limits. 1.0 GB of 1020 MB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Thank you
When you use collect() you are basically telling spark to move all the resulting data back to the master node, which can easily produce a bottleneck. You are no longer using Spark at that point, just a plain array in a single machine.
To trigger computation just use something that requires the data at every node, that's why executors live on top of a distributed file system. For instance saveAsTextFile().
Here are some basic examples.
Remember, the entire objective here (that is, if you have big data) is to move the code to your data and compute there, not to bring all the data to the computation.
TL;DR Don't collect.
To run this code safely, without additional assumptions (on average requirements for worker nodes might be significantly smaller), every node (driver and each executor) would require memory significantly exceeding total memory requirements for all data.
If you were to run it outside Spark you would need only one node. Therefore Spark provides no benefits here.
However if you skip collect.toArray and make some assumptions about data distribution you might run it just fine.

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage

Large task size for simplest program

I am trying to run the simplest program with Spark
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (1 to 10000000).toList
val data = sc.parallelize(dat).cache()
for(i <- 1 to 100){
println(data.reduce(_ + _))
}
}
}
I get the following error message, after each iteration :
WARN TaskSetManager: Stage 0 contains a task of very large size (9767
KB). The maximum recommended task size is 100 KB.
Increasing the data size increases said task size. This suggests to me that the driver is shipping the "dat" object to all executors, but I can't for the life of me see why, as the only operation on my RDD is reduce, which basically has no closure. Any ideas ?
Because you create the very large list locally first, the Spark parallelize method is trying to ship this list to the Spark workers as a single unit, as part of a task. Hence the warning message you receive. As an alternative, you could parallelize a much smaller list, then use flatMap to explode it into the larger list. this also has the benefit of creating the larger set of numbers in parallel. For example:
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest extends App {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (0 to 99).toList
val data = sc.parallelize(dat).cache().flatMap(i => (1 to 1000000).map(j => j * 100 + i))
println(data.count()) //100000000
println(data.reduce(_ + _))
sc.stop()
}
EDIT:
Ultimately the local collection being parallelized has to be pushed to the executors. The parallelize method creates an instance of ParallelCollectionRDD:
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L730
ParallelCollectionRDD creates a number of partitions equal to numSlices:
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L96
numSlices defaults to sc.defaultParallelism which on my machine is 4. So even when split, each partition contains a very large list which needs to be pushed to an executor.
SparkContext.parallelize contains the note #note Parallelize acts lazily and ParallelCollectionRDD contains the comment;
// TODO: Right now, each split sends along its full data, even if
later down the RDD chain it gets // cached. It might be worthwhile
to write the data to a file in the DFS and read it in the split //
instead.
So it appears that the problem happens when you call reduce because this is the point that the partitions are sent to the executors, but the root cause is that you are calling parallelize on a very big list. Generating the large list within the executors is a better approach, IMHO.
Reduce function sends all the data to one single node. When you run sc.parallelize the data is distributed by default to 100 partitions. To make use of the already distributed data use something like this:
data.map(el=> el%100 -> el).reduceByKey(_+_)
or you can do the reduce at partition level.
data.mapPartitions(p => Iterator(p.reduce(_ + _))).reduce(_ + _)
or just use sum :)

Splitting Spark stream by delimiter

I am trying to split my Spark stream based on a delimiter and save each of these chunks to a new file.
Each of my RDDs appear to be partitioned according to the delimiter.
I am having difficulty in configuring one delimiter message per RDD, or, being able to save each partition individually to a new part-000... file .
Any help would be much appreciated. Thanks
val sparkConf = new SparkConf().setAppName("DataSink").setMaster("local[8]").set("spark.files.overwrite","false")
val ssc = new StreamingContext(sparkConf, Seconds(2))
class RouteConsumer extends Actor with ActorHelper with Consumer {
def endpointUri = "rabbitmq://server:5672/myexc?declare=false&queue=in_hl7_q"
def receive = {
case msg: CamelMessage =>
val m = msg.withBodyAs[String]
store(m.body)
}
}
val dstream = ssc.actorStream[String](Props(new RouteConsumer()), "SparkReceiverActor")
val splitStream = dstream.flatMap(_.split("MSH|^~\\&"))
splitStream.foreachRDD( rdd => rdd.saveAsTextFile("file:///home/user/spark/data") )
ssc.start()
ssc.awaitTermination()
You can't control which part-NNNNN (partition) file gets which output, but you can write to different directories. The "easiest" way to do this sort of column splitting is with separate map statements (like SELECT statements), something like this, assuming you'll have n array elements after splitting:
...
val dstream2 = dstream.map(_.split("...")) // like above, but with map
dstream2.cache() // very important for what follows, repeated reads of this...
val dstreams = new Array[DStream[String]](n)
for (i <- 0 to n-1) {
dstreams[i] = dstream2.map(array => array[i] /* or similar */)
dstreams[i].saveAsTextFiles(rootDir+"/"+i)
}
ssc.start()
ssc.awaitTermination()