Task not serializable in scala - scala
In my application, I'm using parallelize method to save an Array into file.
code as follows:
val sourceRDD = sc.textFile(inputPath + "/source")
val destinationRDD = sc.textFile(inputPath + "/destination")
val source_primary_key = sourceRDD.map(rec => (rec.split(",")(0).toInt, rec))
val destination_primary_key = destinationRDD.map(rec => (rec.split(",")(0).toInt, rec))
val extra_in_source = source_primary_key.subtractByKey(destination_primary_key)
val extra_in_destination = destination_primary_key.subtractByKey(source_primary_key)
val source_subtract = source_primary_key.subtract(destination_primary_key)
val Destination_subtract = destination_primary_key.subtract(source_primary_key)
val exact_bestmatch_src = source_subtract.subtractByKey(extra_in_source).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_Dest = Destination_subtract.subtractByKey(extra_in_destination).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_src_p = exact_bestmatch_src.map(rec => (rec.split(",")(0).toInt))
val primary_key_distinct = exact_bestmatch_src_p.distinct.toArray()
for (i <- primary_key_distinct) {
var dummyVar: String = ""
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
var dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray
for (print1 <- src) {
var sourceArr: Array[String] = print1.split(",")
var exactbestMatchCounter: Int = 0
var index: Array[Int] = new Array[Int](1)
println(print1 + "source")
for (print2 <- dest) {
var bestMatchCounter = 0
var i: Int = 0
println(print1 + "source + destination" + print2)
for (i <- 0 until sourceArr.length) {
if (print1.split(",")(i).equals(print2.split(",")(i))) {
bestMatchCounter += 1
}
}
if (exactbestMatchCounter < bestMatchCounter) {
exactbestMatchCounter = bestMatchCounter
dummyVar = print2
index +:= exactbestMatchCounter //9,8,9
}
}
var z = index.zipWithIndex.maxBy(_._1)._2
if (exactbestMatchCounter >= 0) {
var samparr: Array[String] = new Array[String](4)
samparr +:= print1 + " BEST_MATCH " + dummyVar
var deletedest: Array[String] = new Array[String](1)
deletedest = dest.take(z) ++ dest.drop(1)
dest = deletedest
val myFile = sc.parallelize((samparr)).saveAsTextFile(outputPath)
I have used parallelize method and I even tried with below method to save it as a file
val myFile = sc.textFile(samparr.toString())
val finalRdd = myFile
finalRdd.coalesce(1).saveAsTextFile(outputPath)
but its keep throwing the error :
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
You can't treat an RDD like a local collection. All operations against it happen over a distributed cluster. To work, all functions you run in that rdd must be serializable.
The line
for (print1 <- src) {
Here you are iterating over the RDD src, everything inside the loop must be serialize, as it will be run on the executors.
Inside however, you try to run sc.parallelize( while still inside that loop. SparkContext is not serializable. Working with rdds and sparkcontext are things you do on the driver, and cannot do within an RDD operation.
I'm entirely sure what you are trying to accomplish, but it looks like some sort of hand-coded join operation with the source and destination. You can't work with loops in rdds like you can with local collections. Make use of the apis map, join, groupby, and the like to create your final rdd then save that.
If you absolutely feel you must use a foreach loop over the rdd like this, then you can't use sc.parallelize().saveAsTextFile() Instead open an outputstream using the hadoop file api and write your array to the file manually.
Finally this piece of code helps me to save an array to file.
new PrintWriter(outputPath) { write(array.mkString(" ")); close }
Related
Looping through Map Spark Scala
Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file. object twitterAthlete { def loadAthleteNames() : Map[String, String] = { // Handle character encoding issues: implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE) // Create a Map of Ints to Strings, and populate it from u.item. var athleteInfo:Map[String, String] = Map() //var movieNames:Map[Int, String] = Map() val lines = Source.fromFile("../athletes.csv").getLines() for (line <- lines) { var fields = line.split(',') if (fields.length > 1) { athleteInfo += (fields(1) -> fields(7)) } } return athleteInfo } def parseLine(line:String): (String)= { var athleteInfo = loadAthleteNames() var hello = new String for((k,v) <- athleteInfo){ if(line.toString().contains(k)){ hello = k } } return (hello) } def main(args: Array[String]){ Logger.getLogger("org").setLevel(Level.ERROR) val sc = new SparkContext("local[*]", "twitterAthlete") val lines = sc.textFile("../twitter.test") var athleteInfo = loadAthleteNames() val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2)) var hello = new String() val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache container.collect().foreach(println) // val mapping = container.map(x => (x,1)).reduceByKey(_+_) //mapping.collect().foreach(println) } } the first file look like: id,name,nationality,sex,height........ 001,Michael,USA,male,1.96 ... 002,Json,GBR,male,1.76 .... 003,Martin,female,1.73 . ... the second file look likes: time, id , tweet ..... 12:00, 03043, some message that contain some athletes names , ..... 02:00, 03023, some message that contain some athletes names , ..... some thinks like this ... but i got empty result after running this code, any suggestions is much appreciated result i got is empty : ().... ()... ()... but the result that i expected something like: (name,1) (other name,1)
You need to use yield to return value to your map val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first... I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc. Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle. import org.apache.spark.sql.functions._ import org.apache.spark.ml.feature.Tokenizer val tweets = spark.read.format("csv").load(...) val athletes = spark.read.format("csv").load(...) val tokenizer = new Tokenizer() tokenizer.setInputCol("tweet") tokenizer.setOutputCol("words") val tokenized = tokenizer.transform(tweets) val exploded = tokenized.withColumn("word", explode('words)) val withAthlete = exploded.join(athletes, 'word === 'name) withAthlete.select(exploded("id"), 'name).show()
for loop into map method with Spark using Scala
Hi I want to use a "for" into a map method in scala. How can I do it? For example here for each line read I want to generate a random word : val rdd = file.map(line => (line,{ val chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"; val word = new String; val res = new String; val rnd = new Random; val len = 4 + rnd.nextInt((6-4)+1); for(i <- 1 to len){ val char = chars(rnd.nextInt(51)); word.concat(char.toString); } word; })) My current output is : Array[(String, String)] = Array((1,""), (2,""), (3,""), (4,""), (5,""), (6,""), (7,""), (8,""), (9,""), (10,""), (11,""), (12,""), (13,""), (14,""), (15,""), (16,""), (17,""), (18,""), (19,""), (20,""), (21,""), (22,""), (23,""), (24,""), (25,""), (26,""), (27,""), (28,""), (29,""), (30,""), (31,""), (32,""), (33,""), (34,""), (35,""), (36,""), (37,""), (38,""), (39,""), (40,""), (41,""), (42,""), (43,""), (44,""), (45,""), (46,""), (47,""), (48,""), (49,""), (50,""), (51,""), (52,""), (53,""), (54,""), (55,""), (56,""), (57,""), (58,""), (59,""), (60,""), (61,""), (62,""), (63,""), (64,""), (65,""), (66,""), (67,""), (68,""), (69,""), (70,""), (71,""), (72,""), (73,""), (74,""), (75,""), (76,""), (77,""), (78,""), (79,""), (80,""), (81,""), (82,""), (83,""), (84,""), (85,""), (86... I don't know why the right side is empty.
There's no need for var here. It's a one liner Seq.fill(len)(chars(rnd.nextInt(51))).mkString This will create a sequence of Char of length len by repeatedly calling chars(rnd.nextInt(51)), then makes it into a String. Thus you'll get something like this : import org.apache.spark.rdd.RDD import scala.util.Random val chars = ('a' to 'z') ++ ('A' to 'Z') val rdd = file.map(line => { val randomWord = { val rnd = new Random val len = 4 + rnd.nextInt((6 - 4) + 1) Seq.fill(len)(chars(rnd.nextInt(chars.length-1))).mkString } (line, randomWord) })
word.concat doesn't modify word but return a new String, you can make word a variable and add new string to it: var word = new String .... for { ... word += char ... }
Reorganize data in RDD and store in defined format [duplicate]
Below is code for getting list of file Names in a zipped file def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = { val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open)) val filesInZip = new ArrayBuffer[String]() var ze : Option[ZipEntry] = None zipInputStream.foreach(stream =>{ do{ ze = Option(stream.getNextEntry); ze.foreach{ze => if(ze.getName.endsWith("java") && !ze.isDirectory()){ var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java")) filesInZip += fileName } } stream.closeEntry() } while(ze.isDefined) println(filesInZip.toList.length) // print 889 (correct) }) println(filesInZip.toList.length) // print 0 (WHY..?) (filesInZip.toList) } I execute above code in the following manner : scala> val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip") zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[17] at binaryFiles at <console>:25 scala> getListOfFilesInRepo(zipRDD) 889 0 res12: List[String] = List() Why i am not getting 889 and instead getting 0?
It happens because filesInZip is not shared between workers. foreach operates on a local copy of filesInZip and when it finishes this copy is simply discarded and garbage collected. If you want to keep the results you should use transformation (most likely a flatMap) and return collected aggregated values. def listFiles(stream: PortableDataStream): TraversableOnce[String] = ??? zipInputStream.flatMap(listFiles) You can learn more from Understanding closures
spark job freeze when started in ParArray
I want to convert a set of time-serial data to Labeledpoint from multiple csv files and save to parquet file. Csv Files are small, usually < 10MiB When I start it with ParArray, it submit 4 jobs a time and freeze . codes here val idx = Another_DataFrame ListFiles(new File("data/stock data")) .filter(_.getName.contains(".csv")).zipWithIndex .par //comment this line and code runs smoothly .foreach{ f=> val stk = spark_csv(f._1.getPath) //doing good ColMerge(stk,idx,RESULT_PATH(f)) //freeze here stk.unpersist() } and the freeze part: def ColMerge(ori:DataFrame,index:DataFrame,PATH:String) = { val df = ori.join(index,ori("date")===index("index_date")).drop("index_date").orderBy("date").cache val head = df.head val col = df.columns.filter(e=>e!="code"&&e!="date"&&e!="name") val toMap = col.filter{ e=>head.get(head.fieldIndex(e)).isInstanceOf[String] }.sorted val toCast = col.diff(toMap).filterNot(_=="data") val res: Array[((String, String, Array[Double]), Long)] = df.sort("date").map{ row=> val res1= toCast.map{ col=> row.getDouble(row.fieldIndex(col)) } val res2= toMap.flatMap{ col=> val mapping = new Array[Double](GlobalConfig.ColumnMapping(col).size) row.getString(row.fieldIndex(col)).split(";").par.foreach{ word=> mapping(GlobalConfig.ColumnMapping(col)(word)) = 1 } mapping } ( row.getString(row.fieldIndex("code")), row.getString(row.fieldIndex("date")), res1++res2++row.getAs[Seq[Double]]("data") ) }.zipWithIndex.collect df.unpersist val dataset = GlobalConfig.sctx.makeRDD(res.map{ day=> (day._1._1, day._1._2, try{ new LabeledPoint(GetHighPrice(res(day._2.toInt+2)._1._3.slice(0,4))/GetLowPrice(res(day._2.toInt)._1._3.slice(0,4))*1.03,Vectors.dense(day._1._3)) } catch { case ex:ArrayIndexOutOfBoundsException=> new LabeledPoint(-1,Vectors.dense(day._1._3)) } ) }).filter(_._3.label != -1).toDF("code","date","labeledpoint") dataset.write.mode(SaveMode.Overwrite).parquet(PATH) } The exact job that freezes is the DataFrame.sort() or zipWithIndex when generating res in ColMerge Since most part of the job get done after collect I really want to use ParArray to accelerate ColMerge but this weird freeze stopped me from doing so. Do I need to new a thread pool to do this?
Function returns an empty List in Spark
Below is code for getting list of file Names in a zipped file def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = { val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open)) val filesInZip = new ArrayBuffer[String]() var ze : Option[ZipEntry] = None zipInputStream.foreach(stream =>{ do{ ze = Option(stream.getNextEntry); ze.foreach{ze => if(ze.getName.endsWith("java") && !ze.isDirectory()){ var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java")) filesInZip += fileName } } stream.closeEntry() } while(ze.isDefined) println(filesInZip.toList.length) // print 889 (correct) }) println(filesInZip.toList.length) // print 0 (WHY..?) (filesInZip.toList) } I execute above code in the following manner : scala> val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip") zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[17] at binaryFiles at <console>:25 scala> getListOfFilesInRepo(zipRDD) 889 0 res12: List[String] = List() Why i am not getting 889 and instead getting 0?
It happens because filesInZip is not shared between workers. foreach operates on a local copy of filesInZip and when it finishes this copy is simply discarded and garbage collected. If you want to keep the results you should use transformation (most likely a flatMap) and return collected aggregated values. def listFiles(stream: PortableDataStream): TraversableOnce[String] = ??? zipInputStream.flatMap(listFiles) You can learn more from Understanding closures