Scala map doesn't store data - scala

I try to save csv data to hash map. It seems to read csv file and saved well in RDD but NOT map.
I tried hashmap, map with put or += method but nothing works. Any idea of this?
val logFile3 = "d:/data/data.csv"
val rawdf3 = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load(logFile3)
var activityName = scala.collection.mutable.Map[String, String]()
//save key-value to RDD to check
val activityNameRDD = rawdf3.map { row =>
activityName += (row.getAs( "key").toString -> row.getAs( "value").toString) // I think It's work but not
println(row.getAs( "key").toString + " - " + row.getAs( "value").toString) // print all data well
(row.getAs( "key").toString, row.getAs( "value").toString)
}
activityNameRDD.saveAsTextFile( "d:/outdata/activityName") // all csv data saved well
activityName.foreach( {row => println( row._1 + " = " + row._2)}) // print nothing
println( activityName.getOrElse( "KEY1", "NON")) // print "NON"
println( activityName.getOrElse( "KEY2", "NON")) // print "NON"

Are you using Spark? Variables with "Rdd" suffix implies that.
If yes, then read thoroughly "Shared Variables" section of Spark's documentation:
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient.
When you try to modify shared variable from map each worker modifies it's own version and updates are lost in the end. If you really need shared mutable state, consider using Accumulator instead.

Rather than using var, mutable.Map and mutating things as side-effects (three big no-nos in Scala), why not just do things directly? It's clearer what's going on and should also fix your issue:
val activityName:Map[String, String] = rawdf3.map { row =>
row.getAs( "key").toString -> row.getAs( "value").toString
}.toMap

Related

Write data in single file n spark scala

I am trying to write data in single file using spark scala:
while (loop > 0) {
val getReq = new HttpGet(ww.url.com)
val httpResponse = client.execute(getReq)
val data = Source.fromInputStream(httpResponse.getEntity.getContent()).getLines.mkString
val parser = JSON.parseFull(data)
val globalMap = parser.get.asInstanceOf[Map[String, Any]]
val reviewMap = globalMap.get("payload").get.asInstanceOf[Map[String, Any]]
val df = context.sparkContext.parallelize(Seq(reviewMap.get("records").get.toString())).toDF()
if (startIndex == 0) {
df.coalesce(1).write.mode(SaveMode.Overwrite).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
} else {
df.coalesce(1).write.mode(SaveMode.Append).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
}
startIndex = startIndex + limit
loop = loop - 1
httpResponse.close()
}
The number of file created is the number of loops and I want to create one file only.
and it is also creating CRC file as well I want to remove those:
I tried below config but it only stops creation of Success files:
.config("dfs.client.read.shortcircuit.skip.checksum", "true")
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
.config("fs.file.impl.disable.cache", true)
Any ideas to create one file only without crc and success files?
Re: "The number of file created is the number of loops"
Even though you are using df.coalesce(1) in your code, it is still being executed as many number of times as you run the while loop.
I want to create one file only
From your code it seems that you are trying to invoke HTTP GET requests to some URL and save the content after parsing.
If this understanding is right then I believe you should not be using a while loop to do this task. There is map transformation that you could use in the following manner.
Please find below the Psuedo-Code for your reference.
val urls = List("a.com","b.com","c.com")
val sourcedf = sparkContext.parallelize(urls).toDF
//this could be map or flatMap based on your requirement.
val yourprocessedDF = sourcedf.map(<< do your parsing here and emit data>>)
yourprocessedDF.repartition(1).write(<<whichever format you need>>)

how to save a value in a file using Scala

I am trying to save a value in a file, but keep getting an error
I have tried
.saveAsTextFile("/home/amel/timer")`
REDUCER Function
val startReduce = System.currentTimeMillis()
val y = sc.textFile("/home/amel/10MB").filter(!_.contains("NULL")).filter(!_.contains("Null"))
val er = x.map(row => {
val cols = row.split(",")
(cols(1).split("-")(0) + "," + cols(2) + "," + cols(3), 1)
}).reduceByKey(_ + _).map(x => x._1 + "," + x._2)
er.collect.foreach(println)
val endReduce = System.currentTimeMillis()
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
the error I'm receiving is on this line
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
it says: saveAsTextFile is not a member of Long
The output I want is a number
Long does not have a method named saveAsTextFile If you want to write a Long value, there are many ways a simple way is to use java PrintWriter
val duration = ((endReduce-startReduce)/1000)
new PrintWriter("ome/amel/timer/time") { write(duation.toString); close }
If you still want to use spark RDD saveAsTextFile then you can use
sc.parallelize(Seq(duration)).saveAsTextFile("path")
But this does not make sense just to write a single value.
saveAsTextFile is a method on the class org.apache.spark.rdd.RDD (docs)
The expression ((endReduce-startReduce)/1000) is of type Long, so it does not have this method, hence the error you are seeing "saveAsTextFile is not a member of Long"
This answer is applicable here: https://stackoverflow.com/a/32105659/8261
Basically the situation is that you have an Int and you want to write it to a file. Your first thought is to create a distributed collection across a cluster of machines, that only contains this Int and let those machines write the Int to a set of files in a distributed way.
I'd argue this is not the right approach. Do not use Spark for saving an Int into a file. Instead you can use a PrintWriter:
val out = new java.io.PrintWriter("filename.txt")
out.println(finalvalue)
out.close()

What is happening here in Spark Code

val jsonString = sc.sequenceFile[Long,String](paths).map(x => {
x._2
})
The paths variable has csv of locations for some text files.
What exactly is the code mapping here?
can some explain me what is happening here. i am bit confused.
sc.sequenceFile reads Hadoop Sequence Files which are flat files storing key-value pairs (see https://wiki.apache.org/hadoop/SequenceFile). It produces a RDD[(Long, String)] - an RDD where each record is a key-value pair (of type Tuple2[Long, String]). Then, the mapping simply maps each pair to the value only (using Tuple2._2).
The resulting jsonString is an RDD[String].
The map operation can be replaced with a few equivalent calls that may or may not be clearer:
// Using the implicit PairRDDFunctions.values:
val jsonString = sc.sequenceFile[Long,String](paths).values
// Using map with anonymous argument:
val rdd = sc.parallelize(Seq((1, 2))).map(_._2)
// Using map with pattern matching:
val rdd = sc.parallelize(Seq((1, 2))).map {
case (key, value) => value
}

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

scala & Spark - ArrayBuffer does not append

I am new to Scala and Apache Spark and have been trying out some online examples.
I am using scala.collection.mutable.ArrayBuffer to store a list of tuples of the form (Int,Array[String]). I am creating an ArrayBuffer and then parsing a text file line by line and appending the required data from each line to the ArrayBuffer.
The code has no compilation errors. But when I access ArrayBuffer outside the block where I am appending it, I am not able to get the contents and the ArrayBuffer is always empty.
My code is below -
val conf = new SparkConf().setAppName("second")
val spark = new SparkContext(conf)
val file = spark.textFile("\\Desktop\\demo.txt")
var list = scala.collection.mutable.ArrayBuffer[(Int, Array[String])]()
var count = 0
file.map(_.split(","))
.foreach { a =>
count = countByValue(a) // returns an Int
println("count is " + count) // showing correct output "count is 3"
var t = (count, a)
println("t is " + t) // showing correct output "t is (3,[Ljava.lang.String;#539f0af)"
list += t
}
println("list count is = " + list.length) // output "list count is = 0"
list.foreach(println) // no output
Can someone point out why this code isn't working.
Any help is greatly appreciated.
I assume spark is a SparkContext. In this case this is not surprising that the local list is not updated, only its copy sent to spark as a closure. In case you need a mutable value within the foreach, you should use an Accumulator.