Saving and Overwriting a file in Spark Scala - scala

I have a text file where my first column is represented with table name and the second column is represented with date. The delimiter between two columns is represented by space. The data is represented as follows
employee.txt
organization 4-15-2018
employee 5-15-2018
My requirement is to read the file and update the date column based on the business logic and save/overwrite the file. Below is my code
object Employee {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("employeedata")
val sc = new SparkContext(conf)
var input = sc.textFile("D:\\employee\\employee.txt")
.map(line => line.split(' '))
.map(kvPair => (kvPair(0), kvPair(1)))
.collectAsMap()
//Do some operations
// Do iteration and update the hashmap as follows
val finalMap = input + (tableName -> updatedDate)
sc.stop()
}
How to save/overwrite(if exists) the finalMap in the above scenario?

My requirement is to read the file and update the date column based on the business logic and save/overwrite the file.
Never do something like this directly. Always:
Write data to a temporary storage first.
Delete original using standard file system tools.
Rename temporary output using standard file system tools.
An attempt to overwrite data directly will, with high probability, result in a partial or complete data loss.

Related

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

Spark-shell running a SELECT for a dataframe

I've created a spark scala script to load a file with customers information. Then I have created a case class to map the records and show them up as a table, my script below:
//spark context
sc
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Define the class to map customers coming from the data inputh
case class customer (cusid: Int, name: String, city : String, province: String, postalcode: String)
//load the file info
val customer_file = sc.textFile("file:////home/ingenieroandresangel/scalascripts/customer.txt")
val customer_rdd = customer_file.map(_.split(",")).map(p => customer(p(0).toInt,p(1),p(2),p(3),p(4)))
val cusstomerdf = customer_rdd.toDF()
current results:
Now, I need to perform spark sql queries to get back just a column coming from my dataframe, example the column name:
print(cusstomerdf.select("name"))
Nevertheless, the results are not as expected. I need to get back the rows for the column name but instead, I get this result:
Question: How should I run the right select to get back just the column name on my dataframe?? thanks
The result is correct. You are doing a transformation only as select is a transformation.
If you save it in a parquet file or a csv file you will see the result and can confirm that the column is already selected.
Meanwhile you can see the result on the screen by doing
val selecteddf = customerdf.select("name")
selecteddf.show(false)
which will show the 20 rows of name column

Issue in inserting data to Hive Table using Spark and Scala

I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.

How to read many files and assign each file to the next variable?

I am beginner in Scala, I have following question:
How to read more that one csv file and and assign each file to the next variable?
I know how the read one file:
val file_1.sc.textFile("/Users/data/urls_20170225")
I also know how to read many files:
val file_2.sc.textFile("/Users/data/urls_*")
But second way assign all data to one variables file_2, is something that I don't want to! I am looking for elegant way to do this in Spark Scala.
spark has no API to load multiple files into multiple RDD. What you can do is load them one by one into one List of RDD. Below is a sample code.
def main(arg: Array[String]): Unit = {
val dir = """F:\Works\SO\Scala\src\main\resource"""
val startsWith = """urls_""" // we will use this as the wildcard
val fileList:List[File] = getListOfFiles(new File(dir))
val filesRDD: List[RDD[String]] = fileList.collect({
case file: File if file.getName.startsWith(startsWith)=> spark.sparkContext.textFile(file.getPath)
})
}
//Get all the individual file paths
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList

Transformations and Actions in Apache Spark

I have scala code that takes multiple input files from HDFS using wildcards and each files goes into a function where processing takes place for each file individually.
import de.l3s.boilerpipe.extractors.KeepEverythingExtractor
val data = sc.wholeTextFiles("hdfs://localhost:port/akshat/folder/*/*")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String): (String,String) = {
// logic of processing a single file comes here
val logData = sc.textFile(file);
val c = logData.toLocalIterator.mkString
val d = KeepEverythingExtractor.INSTANCE.getText(c)
val e = sc.parallelize(d.split("\n"))
val recipeName = e.take(10).last
val prepTime = e.take(18).last
(recipeName,prepTime)
}
//How transformation and action applied here?
I am stuck at how to to apply further transformations and actions so that all my input files are mapped according to function doSomething and all the output from each of the input files is stored in a single file using saveAsTextFile.
So if my understanding is correct, you have an RDD of Pairs and you wish to transform it some more, and then save the output for each key to a unique file. Transforming it some more is relatively easy, mapValue will allow you to write transformations on just the value, as well any other transformation will work on RDDs of Pairs.
Saving the output to a unique file for each key however, is a bit trickier. One option would be to try and find a hadoopoutput format which does what you want and then use saveAsHadoopFile, another option would be to use foreach and then just write the code to output each key/value pair as desired.