Transformations and Actions in Apache Spark - scala

I have scala code that takes multiple input files from HDFS using wildcards and each files goes into a function where processing takes place for each file individually.
import de.l3s.boilerpipe.extractors.KeepEverythingExtractor
val data = sc.wholeTextFiles("hdfs://localhost:port/akshat/folder/*/*")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String): (String,String) = {
// logic of processing a single file comes here
val logData = sc.textFile(file);
val c = logData.toLocalIterator.mkString
val d = KeepEverythingExtractor.INSTANCE.getText(c)
val e = sc.parallelize(d.split("\n"))
val recipeName = e.take(10).last
val prepTime = e.take(18).last
(recipeName,prepTime)
}
//How transformation and action applied here?
I am stuck at how to to apply further transformations and actions so that all my input files are mapped according to function doSomething and all the output from each of the input files is stored in a single file using saveAsTextFile.

So if my understanding is correct, you have an RDD of Pairs and you wish to transform it some more, and then save the output for each key to a unique file. Transforming it some more is relatively easy, mapValue will allow you to write transformations on just the value, as well any other transformation will work on RDDs of Pairs.
Saving the output to a unique file for each key however, is a bit trickier. One option would be to try and find a hadoopoutput format which does what you want and then use saveAsHadoopFile, another option would be to use foreach and then just write the code to output each key/value pair as desired.

Related

Saving and Overwriting a file in Spark Scala

I have a text file where my first column is represented with table name and the second column is represented with date. The delimiter between two columns is represented by space. The data is represented as follows
employee.txt
organization 4-15-2018
employee 5-15-2018
My requirement is to read the file and update the date column based on the business logic and save/overwrite the file. Below is my code
object Employee {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("employeedata")
val sc = new SparkContext(conf)
var input = sc.textFile("D:\\employee\\employee.txt")
.map(line => line.split(' '))
.map(kvPair => (kvPair(0), kvPair(1)))
.collectAsMap()
//Do some operations
// Do iteration and update the hashmap as follows
val finalMap = input + (tableName -> updatedDate)
sc.stop()
}
How to save/overwrite(if exists) the finalMap in the above scenario?
My requirement is to read the file and update the date column based on the business logic and save/overwrite the file.
Never do something like this directly. Always:
Write data to a temporary storage first.
Delete original using standard file system tools.
Rename temporary output using standard file system tools.
An attempt to overwrite data directly will, with high probability, result in a partial or complete data loss.

Spark Multiple Output Paths result in Multiple Input Reads

First, apologies for the title, I wasn't sure how to eloquently describe this succinctly.
I have a spark job that parses logs into JSON, and then using spark-sql converts specific columns into ORC and writes to various paths. For example:
val logs = sc.textFile("s3://raw/logs")
val jsonRows = logs.mapPartitions(partition => {
partition.map(log => {
logToJson.parse(log)
}
}
jsonRows.foreach(r => {
val contentPath = "s3://content/events/"
val userPath = "s3://users/events/"
val contentDf = sqlSession.read.schema(contentSchema).json(r)
val userDf = sqlSession.read.schema(userSchema).json(r)
val userDfFiltered = userDf.select("*").where(userDf("type").isin("users")
// Save Data
val contentWriter = contentDf.write.mode("append").format("orc")
eventWriter.save(contentPath)
val userWriter = userDf.write.mode("append").format("orc")
userWriter.save(userPath)
When I wrote this I expected that the parsing would occur one time, and then it would write to the respective locations afterward. However, it seems that it is executing all of the code in the file twice - once for content and once for users. Is this expected? I would prefer that I don't end up transferring the data from S3 and parsing twice, as that is the largest bottleneck. I am attaching an image from the Spark UI to show the duplication of tasks for a single Streaming Window. Thanks for any help you can provide!
Okay, this kind of nested DFs is a no go. DataFrames are meant to be a data structure for big datasets that won't fit into normal data structures (like Seq or List) and that needs to be processed in a distributed way. It is not just another kind of array. What you are attempting to do here is to create a DataFrame per log line, which makes little sense.
As far as I can tell from the (incomplete) code you have posted here, you want to create two new DataFrames from your original input (the logs) which you then want to store in two different locations. Something like this:
val logs = sc.textFile("s3://raw/logs")
val contentPath = "s3://content/events/"
val userPath = "s3://users/events/"
val jsonRows = logs
.mapPartitions(partition => {
partition.map(log => logToJson.parse(log))
}
.toDF()
.cache() // Or use persist() if dataset is larger than will fit in memory
jsonRows
.write
.format("orc")
.save(contentPath)
jsonRows
.filter(col("type").isin("users"))
.write
.format("orc")
.save(userPath)
Hope this helps.

How to read many files and assign each file to the next variable?

I am beginner in Scala, I have following question:
How to read more that one csv file and and assign each file to the next variable?
I know how the read one file:
val file_1.sc.textFile("/Users/data/urls_20170225")
I also know how to read many files:
val file_2.sc.textFile("/Users/data/urls_*")
But second way assign all data to one variables file_2, is something that I don't want to! I am looking for elegant way to do this in Spark Scala.
spark has no API to load multiple files into multiple RDD. What you can do is load them one by one into one List of RDD. Below is a sample code.
def main(arg: Array[String]): Unit = {
val dir = """F:\Works\SO\Scala\src\main\resource"""
val startsWith = """urls_""" // we will use this as the wildcard
val fileList:List[File] = getListOfFiles(new File(dir))
val filesRDD: List[RDD[String]] = fileList.collect({
case file: File if file.getName.startsWith(startsWith)=> spark.sparkContext.textFile(file.getPath)
})
}
//Get all the individual file paths
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList

Spark Scala - Process files in parallel and get dataframe for each file

I have thousands of large files to process.
I'm trying to read these files in parallel, and then convert each of them to DataFrame so that I can aggregate data and extract numerical features etc.
I tried sc.wholeTextFiles() which gives us file name and content tuple RDD, but i'm not allowed to use sparkContext/sqlContext to create dataframe which in the RDD map.
val allFiles = sc.wholeTextFiles(inputDir)
val rows = allFiles.map {
case (filename, content) => {
val s = content.split("\n")
// Convert the content to RDD and dataframe later
//val r = sc.parallelize(s); <-- Serialization Error
}
}
Access sc about throws serialization error as we are not supposed to sc to tasks.
I also considered:
val input = sc.textFile(inputDir)
input.mapPartitions(iter => <process partition data>)
But with above approach, when files are large it would split into several partitions and i loose ability to process file as a "whole"
So is there any other option where I can process whole files in parallel, convert each of the dataframe etc?

How to get values from RDD in spark using scala

I want to read the contents of zipfile stored at particular location.
So i used SparkContext.readFile method as shown below :
val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip")
zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[4] at binaryFiles at <console>:21
My Questtion is :
How to get PortableDataStream instance from this RDD.
You can use the collect action: zipFileRDD.collect will return an Array[(String, PortableDataStream)]. But that's normally not what you actually want! If you then read files using these instances, you aren't actually using Spark's capabilities: everything happens in your driver program. Instead, apply map and other transformations so that different files get read on different workers.
If you just want the PortableDataStream outside of an RDD, then just run:
val zipFilePds = zipFileRDD.map(x => x._2).collect()
Using the Apache Commons Compress library, you can do something like this to get at the contents of the zip file (in this case the file listing):
import org.apache.commons.compress.archivers.zip
val zipFileListing = zipFileRDD.map(x => x._2.open()).map(x => { val y = new ZipArchiveInputStream(x) ; y.getNextEntry().getName() }).collect()