Write data in single file n spark scala - scala

I am trying to write data in single file using spark scala:
while (loop > 0) {
val getReq = new HttpGet(ww.url.com)
val httpResponse = client.execute(getReq)
val data = Source.fromInputStream(httpResponse.getEntity.getContent()).getLines.mkString
val parser = JSON.parseFull(data)
val globalMap = parser.get.asInstanceOf[Map[String, Any]]
val reviewMap = globalMap.get("payload").get.asInstanceOf[Map[String, Any]]
val df = context.sparkContext.parallelize(Seq(reviewMap.get("records").get.toString())).toDF()
if (startIndex == 0) {
df.coalesce(1).write.mode(SaveMode.Overwrite).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
} else {
df.coalesce(1).write.mode(SaveMode.Append).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
}
startIndex = startIndex + limit
loop = loop - 1
httpResponse.close()
}
The number of file created is the number of loops and I want to create one file only.
and it is also creating CRC file as well I want to remove those:
I tried below config but it only stops creation of Success files:
.config("dfs.client.read.shortcircuit.skip.checksum", "true")
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
.config("fs.file.impl.disable.cache", true)
Any ideas to create one file only without crc and success files?

Re: "The number of file created is the number of loops"
Even though you are using df.coalesce(1) in your code, it is still being executed as many number of times as you run the while loop.
I want to create one file only
From your code it seems that you are trying to invoke HTTP GET requests to some URL and save the content after parsing.
If this understanding is right then I believe you should not be using a while loop to do this task. There is map transformation that you could use in the following manner.
Please find below the Psuedo-Code for your reference.
val urls = List("a.com","b.com","c.com")
val sourcedf = sparkContext.parallelize(urls).toDF
//this could be map or flatMap based on your requirement.
val yourprocessedDF = sourcedf.map(<< do your parsing here and emit data>>)
yourprocessedDF.repartition(1).write(<<whichever format you need>>)

Related

how to save a value in a file using Scala

I am trying to save a value in a file, but keep getting an error
I have tried
.saveAsTextFile("/home/amel/timer")`
REDUCER Function
val startReduce = System.currentTimeMillis()
val y = sc.textFile("/home/amel/10MB").filter(!_.contains("NULL")).filter(!_.contains("Null"))
val er = x.map(row => {
val cols = row.split(",")
(cols(1).split("-")(0) + "," + cols(2) + "," + cols(3), 1)
}).reduceByKey(_ + _).map(x => x._1 + "," + x._2)
er.collect.foreach(println)
val endReduce = System.currentTimeMillis()
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
the error I'm receiving is on this line
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
it says: saveAsTextFile is not a member of Long
The output I want is a number
Long does not have a method named saveAsTextFile If you want to write a Long value, there are many ways a simple way is to use java PrintWriter
val duration = ((endReduce-startReduce)/1000)
new PrintWriter("ome/amel/timer/time") { write(duation.toString); close }
If you still want to use spark RDD saveAsTextFile then you can use
sc.parallelize(Seq(duration)).saveAsTextFile("path")
But this does not make sense just to write a single value.
saveAsTextFile is a method on the class org.apache.spark.rdd.RDD (docs)
The expression ((endReduce-startReduce)/1000) is of type Long, so it does not have this method, hence the error you are seeing "saveAsTextFile is not a member of Long"
This answer is applicable here: https://stackoverflow.com/a/32105659/8261
Basically the situation is that you have an Int and you want to write it to a file. Your first thought is to create a distributed collection across a cluster of machines, that only contains this Int and let those machines write the Int to a set of files in a distributed way.
I'd argue this is not the right approach. Do not use Spark for saving an Int into a file. Instead you can use a PrintWriter:
val out = new java.io.PrintWriter("filename.txt")
out.println(finalvalue)
out.close()

Spark : How to get the latest file from s3 in the last 10 days

I am trying to get the latest file from s3 in last 10 days when there is no file exist in the input. The issue is the path contains the date.
My path is like this :
val path = "s3://bucket-info/folder1/folder2"
val date = "2019/04/12" ## YYYY/MM/DD
I am doing this =
val update_path = path+"/" +date //this will become s3://bucket-info/folder1/folder2/2019/04/12
def fileExist(path: String, sc: SparkContext): Boolean = FileSystem.get(getS3OrFileUri(path),
sc.hadoopConfiguration).exists(new Path(path + "/_SUCCESS"))
if (fileExist(update_path, sc)) {
//read and process the file
} else {
log("File not exist")
// I need to get the latest file in the last five days and use. So that I can check "s3://bucket-info/folder1/folder2/2019/04/11" , s3://bucket-info/folder1/folder2/2019/04/10 and others. If no latest file in last 5 days. throw error. s
}
But my issue is how do I check when it is the end of the month ? I can do it in for loop but is there any optimized and elegant way to do this in spark ?
Not very optimal but if you want to utilise Spark, the data frame reader can take multiple paths and input_file_name gives you the path:
val path = "s3://bucket-info/folder1/folder2"
val date = "2019/04/12"
val fmt = DateTimeFormatter.ofPattern("yyyy/MM/dd")
val end = LocalDate.parse(date, fmt)
val prefixes = (0 until 10).map(end.minusDays(_)).map(d => s"$path/${fmt.format(d)}")
val prefix = spark.read
.textFile(prefixes:_*)
.select(input_file_name() as "file")
.distinct()
.orderBy(desc("file"))
.limit(1)
.collect().collectFirst {
case Row(prefix: String) => prefix
}
prefix.fold {
// log error
}
{ path =>
//read and process the file
}
This is quite inefficient and there is no clear way around that using Spark as the S3 Hadoop file system implementation is not very efficient using recursive structures. If you are willing to use S3 API directly, you could set s"$path/${fmt.format(end.minusDays(10))}" as a start after parameter and use something like this to list the keys. This works as S3 always returns the key listings sorted alphabetically and you have zero padding in date keys.

How to process Dataframe data in parallel way to call url with bulk of params

I want to read .csv file which has Players info. I have to get the country from this csv and append it to url for further process.
At first I load the .csv data into data-frame. then I do loop on it to append the nationality to url as code below:
val inputDF = spark.read.format("csv").option("header", true).option("inferSchema", true).load(getClass.getResource("/FifaData.csv").getPath).toDF()
var url = ""
val baseUrl = "http://localhost:8080/countries/search?"
val nationalityDF = inputDF.select("Nationality").distinct.rdd.zipWithIndex()
nationalityDF.foreach { case (nationality, idx) =>
val url = s"${baseUrl}page=${idx}&nameList=${nationality.get(0)}"
println("url:: " + url)
}
I wonder if I can avoid for-each to process the data and call the link with out for-each?
Your implementation is already parallelised, so cheers!
To add more details:
foreach in spark is an action which is used to perform some operations with side effects. It operates on RDD in executor JVM if spark is running in cluster mode.
If you want to get rid of foreach all together then you can translate it into an UDF and call it. However, this is not a good practice because, based on your example, you are not looking to get any result back from REST API. Caution: Ugliness Ahead
import org.apache.spark.sql.functions.udf
val inputDF = spark.read.format("csv").option("header", true).option("inferSchema", true).load(getClass.getResource("/FifaData.csv").getPath).toDF()
var url = ""
val baseUrl = "http://localhost:8080/countries/search?"
val nationalityDF = inputDF.select("Nationality").distinct.rdd.zipWithIndex()
.asDF("nationality", "index")
val callRestApi: (nationality, idx)=> String = {
val url = s"""${baseUrl}page=${idx}&nameList=${nationality.mkString(",")}"""
println("url:: " + url)
null
}
nationalityDF.withColumn("placeHolder", callRestApi($"nationality", $"index")).drop("placeHolder")

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

scala & Spark - ArrayBuffer does not append

I am new to Scala and Apache Spark and have been trying out some online examples.
I am using scala.collection.mutable.ArrayBuffer to store a list of tuples of the form (Int,Array[String]). I am creating an ArrayBuffer and then parsing a text file line by line and appending the required data from each line to the ArrayBuffer.
The code has no compilation errors. But when I access ArrayBuffer outside the block where I am appending it, I am not able to get the contents and the ArrayBuffer is always empty.
My code is below -
val conf = new SparkConf().setAppName("second")
val spark = new SparkContext(conf)
val file = spark.textFile("\\Desktop\\demo.txt")
var list = scala.collection.mutable.ArrayBuffer[(Int, Array[String])]()
var count = 0
file.map(_.split(","))
.foreach { a =>
count = countByValue(a) // returns an Int
println("count is " + count) // showing correct output "count is 3"
var t = (count, a)
println("t is " + t) // showing correct output "t is (3,[Ljava.lang.String;#539f0af)"
list += t
}
println("list count is = " + list.length) // output "list count is = 0"
list.foreach(println) // no output
Can someone point out why this code isn't working.
Any help is greatly appreciated.
I assume spark is a SparkContext. In this case this is not surprising that the local list is not updated, only its copy sent to spark as a closure. In case you need a mutable value within the foreach, you should use an Accumulator.