How to write constant value in parquet file using scala? - scala

I'm using spark .I want to save value 2.484 which repeated 13849 times in parquet file instead of console.save values in parquet file instead of console
implicit class Rep(n: Int) {
def times[A](f: => A): Seq[A] = { 1 to n map(_ => f) }
}
val myHis= 13849.times { println("2.4848911616270923")}
this code which repeate value 2.484.
How to save it in parquet file?

import org.apache.spark.sql._
val myHis:Dataset[String] = ???
val path:String = ??? //the path you want
myHis.write.mode(SaveMode.Overwrite).parquet(path)
Try this

Related

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Scala how to define objectFile[Type](path)

I want to read HDFS data, but the data may be saved as saveAsObject[(String,Int,SparseVector)] or saveAsObject[(Int,String,Int)] etc.
So I want to pass command line parameters like "String,Int,SparseVector" etc to my job via spark-submit.
How to get enter code herethe type from the command line arguments for the method saveAsObject[type] ?
object Test2 {
def main(args: Array[String]): Unit = {
val conf=new org.apache.spark.SparkConf()
val sc = new org.apache.spark.SparkContext(conf)
var fmt = "Int,String,SparseVector"
if(args.size!=0){fmt=args(0)}
var fmt_arr=fmt.split(",")
type data_type=(matchClass(fmt_arr(0)),matchClass(fmt_arr(1)),matchClass(fmt_arr(2)))
val data = sc.objectFile[data_type]("")
}
def matchClass(str:String)={
str match {
case "String" => String
case "Int" => Int
case "SparseVector" => SparseVector
case _ => throw new RuntimeException("unsupported type")
}
}
}
All the configuration entries, you could have them in a so called application.conf file!
https://github.com/lightbend/config
You could then read this config file when you do the Spark submit! Have a look here for some examples on how to load an application.conf file into your application. The mechanism should be the same for a Spark application as well!
https://github.com/joesan/plant-simulator/blob/master/app/com/inland24/plantsim/config/AppConfig.scala

Why do SparkSQL UDF return a dataframe with columns names in the format UDF("Original Column Name")?

So the dataframe I get after running the following code is exactly how I want it to be. It is the same dataframe as the original but all cells with purely numeric data have had all brackets and slashes removed (brackets are replaced with a minus sign at the front).
stringModifierIterator takes in a dataframe and returns a List[Column]. The List[Column] can then be used like in the command dataframe.select(List[Column]: _*) to create a new dataframe.
Unfortunately, the column names have been altered to something like UDF("Original Column Name") and I can't figure out why.
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
if(dataFrameColumns.isEmpty){
Nil
} else {
uDF(dataFrame(dataFrameColumns.head)) :: stringModifierIterator(dataFrame, dataFrameColumns.tail, uDF)
}
}
val stringModifierFunction: (String => String) = { s: String => Option(s).map(modifier).getOrElse("0") }
def modifier(inputString: String): String = {
???
}
This is what the column names look like when I use df.show()
You can solve this by explicitly naming the columns you create with the UDF in stringModifierIterator using Column.as:
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
if(dataFrameColumns.isEmpty){
Nil
} else {
val col = dataFrameColumns.head
uDF(dataFrame(col)).as(col) :: stringModifierIterator(dataFrame, dataFrameColumns.tail, uDF)
}
}
BTW, this method can be be much shorter and simpler without recursion:
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
dataFrameColumns.toList.map(col => uDF(dataFrame(col)).as(col))
}

scala/spark:Read in RDD[(String,Int)]

I have the following text file (previously output from an RDD[(String,Int)] )
(ARCHITECTURE,50)
(BUSINESS,17)
(CHEMICAL ENGINEERING,6)
(CHILD DEVELOPMENT,43)
(CIVIL ENGINEERING,26)
etc
I can read in as RDD[String] like this:
spark.sparkContext.textFile(path + s"$path\\${fileName}_labelNames")
But how can I read in as RDD[String,Int]? Is it possible?
EDITED:
Fixed error in RDD type above
There is no RDD[String, Int], it's illegal.
Maybe what you mean is RDD[(String, Int)].
Here is how you can transform it from the original data.
val data = original.map { record =>
val a = record.stripPrefix("(").stripSuffix(")").split(",")
val k = a(0)
val v = a(1).toInt
(k, v)
}
Where original variable is of type RDD[String], as you read from the source.

How do I save a file in a Spark PairRDD using the key as the filename and the value as the contents?

In Spark, I have downloaded multiple files from s3 using sc.binaryFiles. The RDD that results has the key as the filename and the value has the contents of the file. I have decompressed the file contents, csv parsed it, and converted it to a dataframe. So, now I have a PairRDD[String, DataFrame]. The problem I have is that I want to save the file to HDFS using the key as the filename and save the value as a parquet file overwriting one if it already exists. This is what I got so far.
val files = sc.binaryFiles(lFiles.mkString(","), 250).mapValues(stream => sc.parallelize(readZipStream(new ZipInputStream(stream.open))))
val tables = files.mapValues(file => {
val header = file.first.split(",")
val schema = StructType(header.map(fieldName => StructField(fieldName, StringType, true)))
val lines = file.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.flatMap(x => x.split("\n"))
val rowRDD = lines.map(x => Row.fromSeq(x.split(",")))
sqlContext.createDataFrame(rowRDD, schema)
})
If you have any advice, please let me know. I would appreciate it.
Thanks,
Ben
the way to save files to HDFS in spark is the same to hadoop. So you need to create a class which extends MultipleTextOutputFormat, in custom class you can define output filename yourself.the example is below:
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
"realtime-" + new SimpleDateFormat("yyyyMMddHHmm").format(new Date()) + "00-" + name
}
}
the called code is below:
RDD.rddToPairRDDFunctions(rdd.map { case (key, list) =>
(NullWritable.get, key)
}).saveAsHadoopFile(input, classOf[NullWritable], classOf[String], classOf[RDDMultipleTextOutputFormat])