org.apache.spark.SparkException: Task not serializable (scala) - scala

I am new for scala as well as FOR spark, Please help me to resolve this issue.
in spark shell when I load below functions individually they run without any exception, when I copy this function in scala object, and load same file in spark shell they throws task not serialization exception in "processbatch" function when trying to parallelize.
PFB code for the same:
import org.apache.spark.sql.Row
import org.apache.log4j.Logger
import org.apache.spark.sql.hive.HiveContext
object Process {
val hc = new HiveContext(sc)
def processsingle(wait: Int, patient: org.apache.spark.sql.Row, visits: Array[org.apache.spark.sql.Row]) : String = {
var out = new StringBuilder()
val processStart = getTimeInMillis()
for( x <- visits ) {
out.append(", " + x.getAs("patientid") + ":" + x.getAs("visitid"))
}
}
def processbatch(batch: Int, wait: Int, patients: Array[org.apache.spark.sql.Row], visits: Array[org.apache.spark.sql.Row]) = {
val out = sc.parallelize(patients, batch).map( r=> processsingle(wait, r, visits.filter(f=> f.getAs("patientid") == r.getAs("patientid")))).collect()
for(x <- out) println(x)
}
def processmeasures(fetch: Int, batch: Int, wait: Int) = {
val patients = hc.sql("SELECT patientid FROM tableName1 order by p_id").collect()
val visit = hc.sql("SELECT patientid, visitid FROM tableName2")
val count = patients.length
val fetches = if(count % fetch > 0) (count / fetch + 1) else (count / fetch)
for(i <- 0 to fetches.toInt-1){
val startFetch = i*fetch
val endFetch = math.min((i+1)*fetch, count.toInt)-1
val fetchSize = endFetch - startFetch + 1
val fetchClause = "patientid >= " + patients(startFetch).get(0) + " and patientid <= " + patients(endFetch).get(0)
val fetchVisit = visit.filter( fetchClause ).collect()
val batches = if(fetchSize % batch > 0) (fetchSize / batch + 1) else (fetchSize / batch)
for(j <- 0 to batches.toInt-1){
val startBatch = j*batch
val endBatch = math.min((j+1)*batch, fetch.toInt)-1
println(s"Batch from $startBatch to $endBatch");
val batchVisits = fetchVisit.filter(g => g.getAs[Long]("patientid") >= patients(i*fetch + startBatch).getLong(0) && g.getAs[Long]("patientid") <= patients(math.min(i*fetch + endBatch + 1, endFetch)).getLong(0))
processbatch(batch, wait, patients.slice(i*fetch + startBatch, i*fetch + endBatch + 1), batchVisits)
}
}
println("Processing took " + getExecutionTime(processStart) + " millis")
}
}

You should make Process object Serializable:
object Process extends Serializable {
...
}

Related

Bubble sort of random integers in scala

I'm new in Scala programming language so in this Bubble sort I need to generate 10 random integers instead of right it down like the code below
any suggestions?
object BubbleSort {
def bubbleSort(array: Array[Int]) = {
def bubbleSortRecursive(array: Array[Int], current: Int, to: Int): Array[Int] = {
println(array.mkString(",") + " current -> " + current + ", to -> " + to)
to match {
case 0 => array
case _ if(to == current) => bubbleSortRecursive(array, 0, to - 1)
case _ =>
if (array(current) > array(current + 1)) {
var temp = array(current + 1)
array(current + 1) = array(current)
array(current) = temp
}
bubbleSortRecursive(array, current + 1, to)
}
}
bubbleSortRecursive(array, 0, array.size - 1)
}
def main(args: Array[String]) {
val sortedArray = bubbleSort(Array(10,9,11,5,2))
println("Sorted Array -> " + sortedArray.mkString(","))
}
}
Try this:
import scala.util.Random
val sortedArray = (1 to 10).map(_ => Random.nextInt).toArray
You can use scala.util.Random for generation. nextInt method takes maxValue argument, so in the code sample, you'll generate list of 10 int values from 0 to 100.
val r = scala.util.Random
for (i <- 1 to 10) yield r.nextInt(100)
You can find more info here or here
You can use it this way.
val solv1 = Random.shuffle( (1 to 100).toList).take(10)
val solv2 = Array.fill(10)(Random.nextInt)

Scala - append RDD to itself

for (fordate <- 2 to 30) {
val dataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
val a = 1
val c = fordate - 1
for (b <- a to c) {
val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
val cumilativeRDD : org.apache.spark.rdd.RDD[String] = sc.union(cumilativeRDD1, cumilativeRDD)
if (b == c) {
val incrementalDEviceIDs = dataRDD.subtract(cumilativeRDD)
val countofIDs = incrementalDEviceIDs.distinct().count()
println(s"201611 $fordate $countofIDs")
}
}
}
i have a data set where i get deviceIDs on daily basis. i need to figure out the incremental count per day but when i join cumilativeRDD to itself it saysthrows following error:
forward reference extends over definition of value cumilativeRDD
how can i overcome this.
The problem is this line:
val cumilativeRDD : org.apache.spark.rdd.RDD[String] = sc.union(cumilativeRDD1 ,cumilativeRDD)
You're using cumilativeRDD before it's declaration. Variable assignment works from right to left. The right side of = defines the variable on the left. Therefore you cannot use the variable inside it's own definition. Because on the right side of the equation the variable does not yet exist.
You have to init cumilativeRDD in the first run and then you can you use it in following runs:
var cumilativeRDD: Option[org.apache.spark.rdd.RDD[String]] = None
for (fordate <- 2 to 30) {
val DataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
val c = fordate - 1
for (b <- 1 to c) {
val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
if (cumilativeRDD.isEmpty) cumilativeRDD = Some(cumilativeRDD1)
else cumilativeRDD = Some(sc.union(cumilativeRDD1, cumilativeRDD.get))
if (b == c) {
val IncrementalDEviceIDs = DataRDD.subtract(cumilativeRDD.get)
val countofIDs = IncrementalDEviceIDs.distinct().count()
println("201611" + fordate + " " + countofIDs)
}
}
}

Sorting a DStream and taking topN

I have some DStream in Spark Scala and I want to sort it then take the top N.
The problem is that whenever I try to run it I get NotSerializableException and the exception message says:
This is because the DStream object is being referred to from within the closure.
The problem is that I don't know how to solve it:
Here is my try:
package com.badrit.realtime
import java.util.Date
import com.badrit.drivers.UnlimitedSpaceTimeDriver
import com.badrit.model.{CellBuilder, DataReader, Trip}
import com.badrit.utility.Printer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Duration, Milliseconds, StreamingContext}
import scala.collection.mutable
object StreamingDriver {
val appName: String = "HotSpotRealTime"
val hostName = "localhost"
val port = 5050
val constrains = UnlimitedSpaceTimeDriver.constrains;
var streamingRate = 1;
var windowSize = 8;
var slidingInterval = 2;
val cellBuilder = new CellBuilder(constrains)
val inputFilePath = "/home/ahmedelgamal/Downloads/green_tripdata_2015-02.csv"
def prepareTestData(sparkStreamCtx: StreamingContext): InputDStream[Trip] = {
val sparkCtx = sparkStreamCtx.sparkContext
val textFile: RDD[String] = sparkCtx.textFile(inputFilePath)
val data: RDD[Trip] = new DataReader().getTrips(textFile)
val groupedData = data.filter(_.pickup.date.before(new Date(2015, 1, 2, 0, 0, 0)))
.groupBy(trip => trip.pickup.date.getMinutes).sortBy(_._1).map(_._2).collect()
printf("Grouped Data Count is " + groupedData.length)
var dataQueue: mutable.Queue[RDD[Trip]] = mutable.Queue.empty;
groupedData.foreach(trips => dataQueue += sparkCtx.makeRDD(trips.toArray))
printf("\n\nTest Queue size is " + dataQueue.size)
groupedData.zipWithIndex.foreach { case (trips: Iterable[Trip], index: Int) => {
println("Items List " + index)
val passengers: Array[Int] = trips.map(_.passengers).toArray
val cnt = passengers.length
println("Sum is " + passengers.sum)
println("Cnt is " + cnt)
val passengersRdd = sparkCtx.parallelize(passengers)
println("Mean " + passengersRdd.mean())
println("Stdv" + passengersRdd.stdev())
}
}
sparkStreamCtx.queueStream(dataQueue, true)
}
def cellCreator(trip: Trip) = cellBuilder.cellForCarStop(trip.pickup)
def main(args: Array[String]) {
if (args.length < 1) {
streamingRate = 1;
windowSize = 3 //2 hours 60 * 60 * 1000L
slidingInterval = 2 //0.5 hour 60 * 60 * 1000L
}
else {
streamingRate = args(0).toInt;
windowSize = args(1).toInt
slidingInterval = args(2).toInt
}
val sparkConf = new SparkConf().setAppName(appName).setMaster("local[*]")
val sparkStreamCtx = new StreamingContext(sparkConf, Milliseconds(streamingRate))
sparkStreamCtx.sparkContext.setLogLevel("ERROR")
sparkStreamCtx.checkpoint("/tmp")
val data: InputDStream[Trip] = prepareTestData(sparkStreamCtx)
val dataWindow = data.window(new Duration(windowSize), new Duration(slidingInterval))
//my main problem lies in the following line
val newDataWindow = dataWindow.transform(rdd => sparkStreamCtx.sparkContext.parallelize(rdd.take(10)))
newDataWindow.print
sparkStreamCtx.start()
sparkStreamCtx.awaitTerminationOrTimeout(1000)
}
}
I don't mind any other ways to sort a DStream and get its top N rather than my way.
You can use transform method in the DStream object then sort the input RDD and take n elements of it in a list, then filter the original RDD to be contained in this list.
val n = 10
val topN = result.transform(rdd =>{
val list = rdd.sortBy(_._1).take(n)
rdd.filter(list.contains)
})
topN.print

Spark job returns a different result on each run

I am working on a scala code which performs Linear Regression on certain datasets. Right now I am using 20 cores and 25 executors and everytime I run a Spark job I get a different result.
The input size of the files are 2GB and 400 MB.However, when I run the job with 20 cores and 1 executor, I get consistent results.
Has anyone experienced such a thing so far?
Please find the code below:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SchemaRDD
import org.apache.spark.Partitioner
import org.apache.spark.storage.StorageLevel
object TextProcess{
def main(args: Array[String]){
val conf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val numExecutors=(conf.get("spark.executor.instances").toInt)
// Read the 2 input files
// First file is either cases / controls
val input1 = sc.textFile(args(0))
// Second file is Gene Expression
val input2 = sc.textFile(args(1))
//collecting header information
val header1=sc.parallelize(input1.take(1))
val header2=sc.parallelize(input2.take(1))
//mapping data without the header information
val map1 = input1.subtract(header1).map(x => (x.split(" ")(0)+x.split(" ")(1), x))
val map2 = input2.subtract(header2).map(x => (x.split(" ")(0)+x.split(" ")(1), x))
//joining data. here is where the order was getting affected.
val joinedMap = map1.join(map2)
//adding the header back to the top of RDD
val x = header1.union(joinedMap.map{case(x,(y,z))=>y})
val y = header2.union(joinedMap.map{case(x,(y,z))=>z})
//removing irrelevant columns
val rddX = x.map(x=>x.split(" ").drop(3)).zipWithIndex.map{case(a,b)=> a.map(x=>b.toString+" "+x.toString)}
val rddY = y.map(x=>x.split(" ").drop(2)).zipWithIndex.map{case(a,b)=> a.map(x=>b.toString+" "+x.toString)}
//transposing and cross joining data. This keeps the identifier at the start
val transposedX = rddX.flatMap(x => x.zipWithIndex.map(x=>x.swap)).reduceByKey((a,b)=> a+":"+b).map{case(a,b)=>b.split(":").sorted}
val transposedY = rddY.flatMap(x => x.zipWithIndex.map(x=>x.swap)).reduceByKey((a,b)=> a+":"+b).map{case(a,b)=>b.split(":").sorted}.persist(StorageLevel.apply(false, true, false, false, numExecutors))
val cleanedX = transposedX.map(x=>x.map(x=>x.slice(x.indexOfSlice(" ")+1,x.length)))
val cleanedY = transposedY.map(x=>x.map(x=>x.slice(x.indexOfSlice(" ")+1,x.length))).persist(StorageLevel.apply(false, true, false, false, numExecutors))
val cartXY = cleanedX.cartesian(cleanedY)
val finalDataSet= cartXY.map{case(a,b)=>a zip b}
//convert to key value pair
val regressiondataset = finalDataSet.map(x=>(x(0),x.drop(1).filter{case(a,b)=> a!="NA" && b!="NA" && a!="null" && b!="null"}.map{case(a,b)=> (a.toDouble, b.toDouble)}))
val linearOutput = regressiondataset.map(s => new LinearRegression(s._1 ,s._2).outputVal)
linearOutput.saveAsTextFile(args(2))
cleanedY.unpersist()
transposedY.unpersist()
}
}
class LinearRegression (val keys: (String, String),val pairs: Array[(Double,Double)]) {
val size = pairs.size
// first pass: read in data, compute xbar and ybar
val sums = pairs.aggregate(new X_X2_Y(0D,0D,0D))(_ + new X_X2_Y(_),_+_)
val bars = (sums.x / size, sums.y / size)
// second pass: compute summary statistics
val sumstats = pairs.foldLeft(new X2_Y2_XY(0D,0D,0D))(_ + new X2_Y2_XY(_, bars))
val beta1 = sumstats.xy / sumstats.x2
val beta0 = bars._2 - (beta1 * bars._1)
val betas = (beta0, beta1)
//println("y = " + ("%4.3f" format beta1) + " * x + " + ("%4.3f" format beta0))
// analyze results
val correlation = pairs.aggregate(new RSS_SSR(0D,0D))(_ + RSS_SSR.build(_, bars, betas),_+_)
val R2 = correlation.ssr / sumstats.y2
val svar = correlation.rss / (size - 2)
val svar1 = svar / sumstats.x2
val svar0 = ( svar / size ) + ( bars._1 * bars._1 * svar1)
val svar0bis = svar * sums.x2 / (size * sumstats.x2)
/* println("R^2 = " + R2)
println("std error of beta_1 = " + Math.sqrt(svar1))
println("std error of beta_0 = " + Math.sqrt(svar0))
println("std error of beta_0 = " + Math.sqrt(svar0bis))
println("SSTO = " + sumstats.y2)
println("SSE = " + correlation.rss)
println("SSR = " + correlation.ssr)*/
def outputVal() = keys._1
+"\t"+keys._2
+"\t"+beta1
+"\t"+beta0
+"\t"+R2
+"\t"+Math.sqrt(svar1)
+"\t"+Math.sqrt(svar0)
+"\t"+sumstats.y2
+"\t"+correlation.rss
+"\t"+correlation.ssr+"\t;
}
object RSS_SSR {
def build(p: (Double,Double), bars: (Double,Double), betas: (Double,Double)): RSS_SSR = {
val fit = (betas._2 * p._1) + betas._1
val rss = (fit-p._2) * (fit-p._2)
val ssr = (fit-bars._2) * (fit-bars._2)
new RSS_SSR(rss, ssr)
}
}
class RSS_SSR(val rss: Double, val ssr: Double) {
def +(p: RSS_SSR): RSS_SSR = new RSS_SSR(rss+p.rss, ssr+p.ssr)
}
class X_X2_Y(val x: Double, val x2: Double, val y: Double) {
def this(p: (Double,Double)) = this(p._1, p._1*p._1, p._2)
def +(p: X_X2_Y): X_X2_Y = new X_X2_Y(x+p.x,x2+p.x2,y+p.y)
}
class X2_Y2_XY(val x2: Double, val y2: Double, val xy: Double) {
def this(p: (Double,Double), bars: (Double,Double)) = this((p._1-bars._1)*(p._1-bars._1), (p._2-bars._2)*(p._2-bars._2),(p._1-bars._1)*(p._2-bars._2))
def +(p: X2_Y2_XY): X2_Y2_XY = new X2_Y2_XY(x2+p.x2,y2+p.y2,xy+p.xy)
}

use spark run KMeans cluster , program block?

when I use apache spark scala API run the KMeans cluster. my program as follow:
object KMeans {
def closestPoint(p: Vector, centers: Array[Vector]) = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for(i <- 0 until centers.length) {
var tempDist = p.squaredDist(centers(i))
if(tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
bestIndex
}
def parseVector(line: String): Vector = {
new Vector(line.split("\\s+").map(s => s.toDouble))
}
def main(args: Array[String]): Unit = {}
System.setProperty("hadoop.home.dir", "F:/OpenSoft/hadoop-2.2.0")
val sc = new SparkContext("local", "kmeans cluster",
"G:/spark-0.9.0-incubating-bin-hadoop2",
SparkContext.jarOfClass(this.getClass()))
val lines = sc.textFile("G:/testData/synthetic_control.data.txt") // RDD[String]
val count = lines.count
val data = lines.map(parseVector _) // RDD[Vector]
data.foreach(println)
val K = 6
val convergeDist = 0.1
val kPoint = data.takeSample(withReplacement = false, K, 42) // Array[Vector]
kPoint.foreach(println)
var tempDist = 1.0
while(tempDist > convergeDist) {
val closest = data.map(p => (closestPoint(p, kPoint), (p, 1)))
val pointStat = closest.reduceByKey{case ((x1, y1), (x2, y2)) =>
(x1+x2, y1+y2)}
val newKPoint = pointStat.map{pair => (
pair._1,pair._2._1/pair._2._2)}.collectAsMap()
tempDist = 0.0
for(i <- 0 until K) {
tempDist += kPoint(i).squaredDist(newKPoint(i))
}
for(newP <- newKPoint) {
kPoint(newP._1) = newP._2
}
println("Finish iteration (delta=" + tempDist + ")")
}
println("Finish centers: ")
kPoint.foreach(println)
System.exit(0)
}
when I apply run as local mode , the log info as follow:
..................
14/03/31 11:29:15 INFO HadoopRDD: Input split: hdfs://hadoop-01:9000/data/synthetic_control.data:0+288374
program begin block , no running continue........
Can anyone can help me???