scala & Spark - ArrayBuffer does not append - scala

I am new to Scala and Apache Spark and have been trying out some online examples.
I am using scala.collection.mutable.ArrayBuffer to store a list of tuples of the form (Int,Array[String]). I am creating an ArrayBuffer and then parsing a text file line by line and appending the required data from each line to the ArrayBuffer.
The code has no compilation errors. But when I access ArrayBuffer outside the block where I am appending it, I am not able to get the contents and the ArrayBuffer is always empty.
My code is below -
val conf = new SparkConf().setAppName("second")
val spark = new SparkContext(conf)
val file = spark.textFile("\\Desktop\\demo.txt")
var list = scala.collection.mutable.ArrayBuffer[(Int, Array[String])]()
var count = 0
file.map(_.split(","))
.foreach { a =>
count = countByValue(a) // returns an Int
println("count is " + count) // showing correct output "count is 3"
var t = (count, a)
println("t is " + t) // showing correct output "t is (3,[Ljava.lang.String;#539f0af)"
list += t
}
println("list count is = " + list.length) // output "list count is = 0"
list.foreach(println) // no output
Can someone point out why this code isn't working.
Any help is greatly appreciated.

I assume spark is a SparkContext. In this case this is not surprising that the local list is not updated, only its copy sent to spark as a closure. In case you need a mutable value within the foreach, you should use an Accumulator.

Related

why my simple spark code can not print anything?

def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd1: RDD[Int] = sc.makeRDD(List(2, 4, 6, 8), 2)
// just print datas partition info then reture partition datas with no changes
val rdd2: RDD[Int] = rdd1.mapPartitionsWithIndex((par, datas) => {
println("data and partition info : par = " + par + " datas = " + datas.mkString(" "))
datas // return datas again
})
// i think there are 2,4,6,8 four elements in rdd2
// so i foreach rdd2 but nothing output, why this happen?
rdd2.collect().foreach(println)
sc.stop()
}
i am studying spark and i write a simple demo code with spark. but there is some question i do not understand.
i can not figure out why the code rdd2.collect().foreach(println) can not print anything ?
Your problem is that you are returning an Iterator in the mapPartition function which is already traversed when you use the mkString function. Iterators are special collections that help to deal with large partitions reading elements one by one. They are used in the different functions of the RDD api like forEach, mapPartition, zipPartition, etc .... Take a look at how they work. And pay attention to this statement: "one should never use an iterator after calling a method on it.". Drop the println line and it should work.

Scala: Task not serializable when using closure

So I am fairly new to Spark and Scala and from my understanding you should be able to pass a closure into a map function and have it modify the values however I am getting the Task not serializable error when attempting this.
My code is as follows:
// Spark Context
val sparkContext = spark.sparkContext
val random = scala.util.Random
// RDD Initialization
val array = Seq.fill(500)(random.nextInt(51))
val RDD = sc.parallelize(array)
// Spark Operations for Count, Sum, and Mean
var count = RDD.count()
var sum = RDD.reduce(_+_)
val mean = sum / count;
//Output Count, Sum, and Mean
println("Count: " + count)
println("Sum: " + sum)
println("Mean: " + mean)
val difference = (x:Int) => {x - mean}
var differences = RDD.map(difference)
Any help would be greatly appreciated
Instead of defining a function, try using a val fun
val difference = (x:Int) => {x-mean}
When you use def to define a function, Spark would try to serialize your object that has this function. This usually results into TaskNotSerializable as there might be something (val or var) in that Object which is not serializable.

Write data in single file n spark scala

I am trying to write data in single file using spark scala:
while (loop > 0) {
val getReq = new HttpGet(ww.url.com)
val httpResponse = client.execute(getReq)
val data = Source.fromInputStream(httpResponse.getEntity.getContent()).getLines.mkString
val parser = JSON.parseFull(data)
val globalMap = parser.get.asInstanceOf[Map[String, Any]]
val reviewMap = globalMap.get("payload").get.asInstanceOf[Map[String, Any]]
val df = context.sparkContext.parallelize(Seq(reviewMap.get("records").get.toString())).toDF()
if (startIndex == 0) {
df.coalesce(1).write.mode(SaveMode.Overwrite).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
} else {
df.coalesce(1).write.mode(SaveMode.Append).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
}
startIndex = startIndex + limit
loop = loop - 1
httpResponse.close()
}
The number of file created is the number of loops and I want to create one file only.
and it is also creating CRC file as well I want to remove those:
I tried below config but it only stops creation of Success files:
.config("dfs.client.read.shortcircuit.skip.checksum", "true")
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
.config("fs.file.impl.disable.cache", true)
Any ideas to create one file only without crc and success files?
Re: "The number of file created is the number of loops"
Even though you are using df.coalesce(1) in your code, it is still being executed as many number of times as you run the while loop.
I want to create one file only
From your code it seems that you are trying to invoke HTTP GET requests to some URL and save the content after parsing.
If this understanding is right then I believe you should not be using a while loop to do this task. There is map transformation that you could use in the following manner.
Please find below the Psuedo-Code for your reference.
val urls = List("a.com","b.com","c.com")
val sourcedf = sparkContext.parallelize(urls).toDF
//this could be map or flatMap based on your requirement.
val yourprocessedDF = sourcedf.map(<< do your parsing here and emit data>>)
yourprocessedDF.repartition(1).write(<<whichever format you need>>)

how to save a value in a file using Scala

I am trying to save a value in a file, but keep getting an error
I have tried
.saveAsTextFile("/home/amel/timer")`
REDUCER Function
val startReduce = System.currentTimeMillis()
val y = sc.textFile("/home/amel/10MB").filter(!_.contains("NULL")).filter(!_.contains("Null"))
val er = x.map(row => {
val cols = row.split(",")
(cols(1).split("-")(0) + "," + cols(2) + "," + cols(3), 1)
}).reduceByKey(_ + _).map(x => x._1 + "," + x._2)
er.collect.foreach(println)
val endReduce = System.currentTimeMillis()
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
the error I'm receiving is on this line
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
it says: saveAsTextFile is not a member of Long
The output I want is a number
Long does not have a method named saveAsTextFile If you want to write a Long value, there are many ways a simple way is to use java PrintWriter
val duration = ((endReduce-startReduce)/1000)
new PrintWriter("ome/amel/timer/time") { write(duation.toString); close }
If you still want to use spark RDD saveAsTextFile then you can use
sc.parallelize(Seq(duration)).saveAsTextFile("path")
But this does not make sense just to write a single value.
saveAsTextFile is a method on the class org.apache.spark.rdd.RDD (docs)
The expression ((endReduce-startReduce)/1000) is of type Long, so it does not have this method, hence the error you are seeing "saveAsTextFile is not a member of Long"
This answer is applicable here: https://stackoverflow.com/a/32105659/8261
Basically the situation is that you have an Int and you want to write it to a file. Your first thought is to create a distributed collection across a cluster of machines, that only contains this Int and let those machines write the Int to a set of files in a distributed way.
I'd argue this is not the right approach. Do not use Spark for saving an Int into a file. Instead you can use a PrintWriter:
val out = new java.io.PrintWriter("filename.txt")
out.println(finalvalue)
out.close()

Scala map doesn't store data

I try to save csv data to hash map. It seems to read csv file and saved well in RDD but NOT map.
I tried hashmap, map with put or += method but nothing works. Any idea of this?
val logFile3 = "d:/data/data.csv"
val rawdf3 = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load(logFile3)
var activityName = scala.collection.mutable.Map[String, String]()
//save key-value to RDD to check
val activityNameRDD = rawdf3.map { row =>
activityName += (row.getAs( "key").toString -> row.getAs( "value").toString) // I think It's work but not
println(row.getAs( "key").toString + " - " + row.getAs( "value").toString) // print all data well
(row.getAs( "key").toString, row.getAs( "value").toString)
}
activityNameRDD.saveAsTextFile( "d:/outdata/activityName") // all csv data saved well
activityName.foreach( {row => println( row._1 + " = " + row._2)}) // print nothing
println( activityName.getOrElse( "KEY1", "NON")) // print "NON"
println( activityName.getOrElse( "KEY2", "NON")) // print "NON"
Are you using Spark? Variables with "Rdd" suffix implies that.
If yes, then read thoroughly "Shared Variables" section of Spark's documentation:
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient.
When you try to modify shared variable from map each worker modifies it's own version and updates are lost in the end. If you really need shared mutable state, consider using Accumulator instead.
Rather than using var, mutable.Map and mutating things as side-effects (three big no-nos in Scala), why not just do things directly? It's clearer what's going on and should also fix your issue:
val activityName:Map[String, String] = rawdf3.map { row =>
row.getAs( "key").toString -> row.getAs( "value").toString
}.toMap